CGDoom/src-cg/cgdoom-asm.s

.global _CGD_sector_memcmp
.align 4

# A pretty fast memcmp for 512-byte sectors, with equal(0)-different(1) output
# r4: 32-aligned pointer to sector in RAM (preferably 1-cycle operand bus RAM)
# r5: 32-aligned pointer to sector in ROM
# r6: 512 (ignored; for compatibility with memcmp prototype)
#
# There are two main ideas in this code:
#
# * Read with words, since such is the affinity of the ROM. (I don't know why.)
#   I tested with longwords, the performance is much worse; bytes are somewhere
#   in-between, which tormented me as I wondered why the most trivial memcmp()
#   with poor assembler from libfxcg was faster than my hand-written function.
#
# * Weave iterations with smart register allocation to exploit superscalar
#   parallelism. We read to r0/r1 while comparing r2/r3, then vice-versa. The
#   two mov.w (LS) for one comparison execute in parallel with the cmp (EX) and
#   bf (BR) of the previous comparison, so overall one comparison takes 2
#   cycles (plus any extra cycles in ROM reads if the cache isn't hit or
#   doesn't respond immediately, and some loop overhead).
#
_CGD_sector_memcmp:
	# For the first 32 bytes, compare as fast as possible to exit early
	# when the sectors don't match (this saves a little bit).
	mov	#16, r7
1:	mov.w	@r5+, r0
	mov.w	@r4+, r1
	cmp/eq	r0, r1
	bf	.fail
	dt	r7
	bf	1b

	mov	#30, r7

.line:
	# There is a 2-cycle delay for the RAW dependency between each mov.b
	# and the corresponding use. Here the delay is honored so there are no
	# cycles lost to RAW dependencies.

	mov.w	@r5+, r0
	nop

	mov.w	@r4+, r1
	nop

	mov.w	@r5+, r2
	nop

	mov.w	@r4+, r3
	cmp/eq	r0, r1

	mov.w	@r5+, r0
	bf	.fail

	mov.w	@r4+, r1
	cmp/eq	r2, r3

	mov.w	@r5+, r2
	bf	.fail

	mov.w	@r4+, r3
	cmp/eq	r0, r1

	mov.w	@r5+, r0
	bf	.fail

	mov.w	@r4+, r1
	cmp/eq	r2, r3

	mov.w	@r5+, r2
	bf	.fail

	mov.w	@r4+, r3
	cmp/eq	r0, r1

	mov.w	@r5+, r0
	bf	.fail

	mov.w	@r4+, r1
	cmp/eq	r2, r3

	mov.w	@r5+, r2
	bf	.fail

	mov.w	@r4+, r3
	cmp/eq	r0, r1

	# These two can run in parallel (BR/EX)
	bf	.fail
	cmp/eq	r2, r3

	bf	.fail

	dt	r7
	bf	.line

.success:
	rts
	mov	#0, r0

.fail:
	# We don't specify an order
	rts
	mov	#1, r0
Optimize loading speed (x2.7) and game speed (+35%) Loading is measured by RTC_GetTicks(). * Initial version: 9.8s This was a regression due to using 512-byte sectors instead of 4 kiB clusters as previously. * Do BFile reads of 4 kiB: 5.2s (-47%) Feels similar to original code, I'll take this as my baseline. * Test second half of Flash first: 3.6s (-31%) By reading from FLASH_FS_HINT to FLASH_END first many OS sectors can be skipped (without missing on other sectors just in case). * Load to XRAM instead or RAM with BFile The DMA is 10% slower to XRAM than to RAM, but this benefits memcmp() because of faster memory accesses through the operand bus. No effect at this point, but ends up saving 8% after memcmp is optimized. * Optimize memcmp for sectors: 3376 ms (-8%) The optimized memcmp uses word accesses for ROM (which is fastest), and weaves loop iterations to exploit superscalar parallelism. * Search sectors most likely to contain data first: 2744 ms (-19%) File fragments almost always start on 4-kiB boundaries between FLASH_FS_HINT and FLASH_END, so these are tested first. * Index most likely sectors, improve FLASH_FS_HINT: 2096 ms (-24%) Most likely sectors are indexed by first 4 bytes and binary searched, and a slightly larger region is considered for hints. The cache hits 119/129 fragments in my case. * Use optimized memcmp for consecutive fragments: 1408 ms (-33%) I only set it for the search of the first sector in each fragment and forgot to use it where it is really needed. x) Game speed is measured roughly by the time it takes to hit a wall by walking straight after spawning in Hangar. * Initial value: 4.4s * Use cached ROM when loading data from the WAD: 2.9s (-35%) Cached accesses are quite detrimental for sector search, I assume because everything is aligned like crazy, but it's still a major help when reading sequential data in real-time. 2021-07-28 22:51:03 +02:00			`.global _CGD_sector_memcmp`
			`.align 4`

			`# A pretty fast memcmp for 512-byte sectors, with equal(0)-different(1) output`
			`# r4: 32-aligned pointer to sector in RAM (preferably 1-cycle operand bus RAM)`
			`# r5: 32-aligned pointer to sector in ROM`
			`# r6: 512 (ignored; for compatibility with memcmp prototype)`
			`#`
			`# There are two main ideas in this code:`
			`#`
			`# * Read with words, since such is the affinity of the ROM. (I don't know why.)`
			`# I tested with longwords, the performance is much worse; bytes are somewhere`
			`# in-between, which tormented me as I wondered why the most trivial memcmp()`
			`# with poor assembler from libfxcg was faster than my hand-written function.`
			`#`
			`# * Weave iterations with smart register allocation to exploit superscalar`
			`# parallelism. We read to r0/r1 while comparing r2/r3, then vice-versa. The`
			`# two mov.w (LS) for one comparison execute in parallel with the cmp (EX) and`
			`# bf (BR) of the previous comparison, so overall one comparison takes 2`
			`# cycles (plus any extra cycles in ROM reads if the cache isn't hit or`
			`# doesn't respond immediately, and some loop overhead).`
			`#`
			`_CGD_sector_memcmp:`
			`# For the first 32 bytes, compare as fast as possible to exit early`
			`# when the sectors don't match (this saves a little bit).`
			`mov #16, r7`
			`1: mov.w @r5+, r0`
			`mov.w @r4+, r1`
			`cmp/eq r0, r1`
			`bf .fail`
			`dt r7`
			`bf 1b`

			`mov #30, r7`

			`.line:`
			`# There is a 2-cycle delay for the RAW dependency between each mov.b`
			`# and the corresponding use. Here the delay is honored so there are no`
			`# cycles lost to RAW dependencies.`

			`mov.w @r5+, r0`
			`nop`

			`mov.w @r4+, r1`
			`nop`

			`mov.w @r5+, r2`
			`nop`

			`mov.w @r4+, r3`
			`cmp/eq r0, r1`

			`mov.w @r5+, r0`
			`bf .fail`

			`mov.w @r4+, r1`
			`cmp/eq r2, r3`

			`mov.w @r5+, r2`
			`bf .fail`

			`mov.w @r4+, r3`
			`cmp/eq r0, r1`

			`mov.w @r5+, r0`
			`bf .fail`

			`mov.w @r4+, r1`
			`cmp/eq r2, r3`

			`mov.w @r5+, r2`
			`bf .fail`

			`mov.w @r4+, r3`
			`cmp/eq r0, r1`

			`mov.w @r5+, r0`
			`bf .fail`

			`mov.w @r4+, r1`
			`cmp/eq r2, r3`

			`mov.w @r5+, r2`
			`bf .fail`

			`mov.w @r4+, r3`
			`cmp/eq r0, r1`

			`# These two can run in parallel (BR/EX)`
			`bf .fail`
			`cmp/eq r2, r3`

			`bf .fail`

			`dt r7`
			`bf .line`

			`.success:`
			`rts`
			`mov #0, r0`

			`.fail:`
			`# We don't specify an order`
			`rts`
			`mov #1, r0`