This occurs if all texture IDs on a particular line are multiples of
256 (like STARGR2 = 256) since the cast to boolean (byte) destroys
significant bits.
Cumulating lengths will allow a binary search for faster lookup. This is
important because performance seems to vary wildly with the number of
fragments, which I suspect is related to the linear search algorithms
(as there are often several hundred fragments).
Performance had dropped to 3.5s from 2.9s since last test, and
surprisingly this change pulls it back up to 2.9s, even though the
number of fragments now (150) is still more than during the first test
(100).
I suspect binary search will improve performance again. This would be
very helpful, as it would prove that WAD access is the primary
bottleneck for the game. Unlike actual game code, WAD access is
something we can look at and even optimize.
This change adds proper key control by querying the KEYSC directly
instead of using PRGM_GetKey(). This allows for the very distinctive
advantage of pressing multiples keys at once.
Controls are still quite hard to use, I'll think of an alternative
keymap.
lumpinfo is now allocated in Z_Malloc because it's needed for some
larger WADs.
More heap is needed to compensate and to support larger WADs fully, so
the unused part of the user stack is added as a second zone.
This makes at least the start of the DOOM Ultimate WAD playable.
The bar takes up a little bit of time too, but I think it's a plus.
Currently it's limited to ~20 frames which is normally < 0.3s. A frame
every fragment is disastrous in comparison (loading time x3 lol).
This was using screens[1] which I had deallocated when fixing the status
bar (I incorrectly assumed it was used only for that).
While the CGDOOM technique to share screens[1] to avoid allocating the
320x20 buffer for the status bar makes clear sense with that new
information, I think I'll keep this 6.4 kB buffer there and rather
search for ways to use more memory zones.
Loading is measured by RTC_GetTicks().
* Initial version: 9.8s
This was a regression due to using 512-byte sectors instead of 4 kiB
clusters as previously.
* Do BFile reads of 4 kiB: 5.2s (-47%)
Feels similar to original code, I'll take this as my baseline.
* Test second half of Flash first: 3.6s (-31%)
By reading from FLASH_FS_HINT to FLASH_END first many OS sectors can
be skipped (without missing on other sectors just in case).
* Load to XRAM instead or RAM with BFile
The DMA is 10% slower to XRAM than to RAM, but this benefits memcmp()
because of faster memory accesses through the operand bus. No effect
at this point, but ends up saving 8% after memcmp is optimized.
* Optimize memcmp for sectors: 3376 ms (-8%)
The optimized memcmp uses word accesses for ROM (which is fastest),
and weaves loop iterations to exploit superscalar parallelism.
* Search sectors most likely to contain data first: 2744 ms (-19%)
File fragments almost always start on 4-kiB boundaries between
FLASH_FS_HINT and FLASH_END, so these are tested first.
* Index most likely sectors, improve FLASH_FS_HINT: 2096 ms (-24%)
Most likely sectors are indexed by first 4 bytes and binary searched,
and a slightly larger region is considered for hints. The cache hits
119/129 fragments in my case.
* Use optimized memcmp for consecutive fragments: 1408 ms (-33%)
I only set it for the search of the first sector in each fragment and
forgot to use it where it is really needed. x)
Game speed is measured roughly by the time it takes to hit a wall by
walking straight after spawning in Hangar.
* Initial value: 4.4s
* Use cached ROM when loading data from the WAD: 2.9s (-35%)
Cached accesses are quite detrimental for sector search, I assume
because everything is aligned like crazy, but it's still a major help
when reading sequential data in real-time.