.global _gint_image_p8_loop /* gint's image renderer: 8-bit indexed entry point P8 compacts images by indexing each pixel on a 256-color palette, thus halving the amount of data per pixel. This comes at the cost of an additional lookup during rendering. For these format, there is no way to bundle pixels together, and the more advanced loops handle pixels individually with a 2-unrolled 2-stage-pipeline structure to accelerate the CPU processing when that is the bottleneck (which often means where there are transparent pixels to skip). For readers not familiar with loop optimization literature, the main idea is that a simple loop which loads a pixel, processes it, and writes it, is too inefficient because of RAW delays. To use the full speed of the CPU, one needs to do more work in parallel and spread out actions on a single pixel, which we do here with two loop transforms: * _Pipelining_ the loop consists in handling a single pixel over several iterations by doing a little bit of work in each iteration. The data for the pixel would move from register to register at each iteration, with the loop code doing one stage's worth of computation on each register. This gives us more pixels to work on simultaneously, and more independent work means less RAW limitations. Loops in this renderer have 2 stages at most. * _Unrolling_ iterations of the loop consists in loading two (or more) pixels at the start of each iteration so that we can work on one while waiting for stalls and dependencies on the other. Unlike pipelining, pixels are still confined within iterations. Non-trivial loops in this renderer process 2 pixels per iteration. Unrolling has one major flaw: handling pairs of pixels only works if the total amount of pixels to draw is even. The usual way to handle this for n pixels is to do ⌊n/2⌋ iterations and handle the last pixel individually if n is odd. This is extremely annoying, since every row must check the value of n, and an extra copy of the loop code for a single pixel must be maintained on the side, which takes more space and more effort. However, we have a specialized solution here with *edge pixels*. The idea of edge pixels is to round the number of pixels *up* and perform ⌊(n+1)/2⌋ runs of the inner loop. If n is odd, this will overwrite a single pixel at the end of the line. We can cancel this error after-the-fact by saving the value of the (n+1)-th pixel of the line before the loop, and restoring it afterwards. Note that if n is even then the save/restore is a no-op. This takes some caution however, as the temporary overwrite could be seen by an interrupt. Some measures are put in place to reserve a couple of bytes on each side of gint's VRAM and Azur's target fragment to avoid any problems. r0: - (initially: cmd.effect) r1: Number of lines remaining to draw r2: Number of columns per line r3: Input pointer r4: Input stride r5: Output pointer r6: Output stride r7: Right edge or [temporary] r8: - (initially: cmd) r9: - (initially: cmd.loop) */ _gint_image_p8_loop: /* r4: int output_width (pixels) r5: struct gint_image_cmd *cmd */ mov.b @(1,r5), r0 /* cmd.effect */ add #2, r5 mov.l r8, @-r15 mov r4, r6 mov.w @r5+, r2 /* cmd.columns */ mov r5, r8 /* For here on the command is r8 */ mov.l r9, @-r15 shlr r0 /* T bit is now VFLIP */ mov.w @r8+, r4 /* cmd.input_stride */ sub r2, r6 mov.b @r8+, r1 /* cmd.lines */ add r6, r6 mov.b @r8+, r9 /* cmd.edge_1 - don't care */ nop mov.l @r8+, r9 extu.b r1, r1 mov.l @r8+, r5 /* cmd.output */ nop bf.s _NO_VFLIP mov.l @r8+, r3 /* cmd.input */ _VFLIP: neg r4, r4 nop _NO_VFLIP: jmp @r9 sub r2, r4