An optimization suggested by TSWilliamson, which pushes not only RAM, but also on-chip memory and the CPU pipeline to their limits.
This change adds optimized versions of the core memory functions, relying on 4-alignment, 2-alignment, and the SH4's unaligned move instruction to (hopefully) attain good performance in all situations.