8 bopti on fx 9860G
Lephenixnoir edited this page 2 years ago

bopti on fx-9860G

The bitmap drawing module, bopti, renders images using direct bitwise operations on video RAM (VRAM) longwords. This method makes extensive use of the 4-alignment of gint's VRAM to operate on 32 pixels at a time and avoid costly single-bit operations.

In gint's development workflow, images in usual formats are first converted to the bopti format at compile-time. The bopti format is designed for fast rendering: it consists of one or several monochrome bitmaps called layers, arranged in a fixed combination called a profile. To each profile corresponds an assembler routine designed to quickly render the image.

Performance

(TODO)

Probably about 15 times as fast as MonochromeLib.

Color profiles

When converting an image, the fxconv tool of the fxSDK first quantizes the colors by mapping transparent pixels to alpha and every other pixel to the closest color in these four:

Color name Hexadecimal
black #000000
dark #555555
light #aaaaaa
white #ffffff

Then the image is assigned the smallest profile that can represent all of its colors:

Profile Supported colors
mono black, white
mono_alpha black, white, alpha
gray black, white, light, dark
gray_alpha black, white, light, dark, alpha

Layers

Each profile has a fixed number of layers with a predefined meaning. During rendering, all of the layers are blit in order to produce the image. The number of layers in a profile is always minimal: it is ⌊ 1 + log n ⌋ where n is the number of colors.

On fx-9860G, the VRAM is either monochrome or 4-color gray, so pixel colors can only take 2 or 4 different values. This makes logical operations a privileged method to implement blitting methods, because logical operations can effortlessly be extended to apply on multiple pixels at once.

The current version of bopti uses the following types of layers:

Layer name Category Effect for 0-bits Effect for 1-bits
fill Monochrome Paints white Paints black
white Monochrome - Paints white
black Monochrome - Paints black
lfill Gray Clears light VRAM Paints light VRAM
dfill Gray Clears dark VRAM Paints dark VRAM
light Gray - Paints light gray
dark Gray - Paints dark gray

When performing an operation, bopti takes data from the encoded image and applies bitwise operations for all layers. It then moves to a different part of the image. The previous version of bopti applied each layer independently, but the current version applies them all at once, saving even more time.

Note that most functions do nothing on 0-bits; this is an optimization related to rectangle masks. When a VRAM longword is loaded to a register, often the blitted image will not cover it entirely. The pixels that must be preserved are represented in a structure called a rectangle mask. Having this neutral 0-bit makes it simple to preserve relevant pixels while drawing the image by setting the corresponding rectangle mask bits to 0. When layers don't have this preserving 0-bit, masks must instead be applied manually. See later for more details.

Here is the relationship between color profiles and their layers:

  • The mono profile only has a fill layer.
  • The mono_alpha profile starts with a white layer to clear the non-transparent region of the image, then blits a black layer to render the content.
  • The gray profile has an lfill and a dfill layer. These two types of layer act on different VRAMs.
  • The gray_alpha profile starts by blitting a white layer on both VRAMs, then adds a light layer and a dark layer.

Logical operations on pixels

As a reference, here are the logical operations used to blit layers on past and present versions of bopti. The x parameter is a boolean; the transformation must happen iff x=1. The significance of x appears when extending the logical operations to a longword: it allows controlling 32 pixels individually while still using only a couple logical instructions.

black  (data, x) = data | x
white  (data, x) = data & ~x
invert (data, x) = data ^ x

For gray images, we need to know that the gray engine produces an illusion of intermediate color by quickly alternating two buffers on the screen, with a different duration for each. This way, the proportion of time each pixel is black is one of four different values. Assuming long and short represent the value of a pixel in the VRAMs that respectively stay longer and shorter on the screen, we have the following encoding:

white     = 0 (long=0 short=0)
lightgray = 1 (long=0 short=1)
darkgray  = 2 (long=1 short=0)
black     = 3 (long=1 short=1)

So operations on gray pixels will modify two VRAMs.

Among interesting operations, we have ligthen, which shifts all values towards white (and white remains white), as if decrementing them, and darken that shifts all values towards black (and black remains black), as if incrementing them.

black   (light, dark, x) = (light | x, dark | x)
dark    (light, dark, x) = (light & ~x, dark | x)
light   (light, dark, x) = (light | x, dark & ~x)
white   (light, dark, x) = (light & ~x, dark & ~x)
inverse (light, dark, x) = (light ^ x, dark ^ x)
lighten (light, dark, x) = ((light ^ x) & (dark | ~x), dark & (light | ~x))
darken  (light, dark, x) = ((light ^ x) | (dark & x), dark | (light & x))

These functions are obtained by staring at a truth table, then adding a linear number of x's to neutralize some operands when x=0.

Assembler-driven rendering

The previous implementation of bopti was already fast, usually about 8 times as fast as MonochromeLib. Half of the speedup was due to VRAM alignment, and the other half was related to implementation and format. It had, however, two limiting factors:

  1. The operation function was a generic function taking the color as argument, and it used a switch to decide which operation to apply;
  2. Each layer was drawn independently, so the 2D structure of the image was unnecessarily traversed several times.

These two limitations are related and can be overcome by specializing the rendering code, which is the deepest in the critical loop. The current version of bopti has one specialized rendering function per color profile, implemented in assembler, which loops and renders altogether.

Image format

The conversion is performed by fxconv at compile-time and outputs a big-endian data structure that can be efficiently traversed from the add-in.

The image is first extended to make its width a multiple of 32 pixels, then stored in row-major order:

   (32)     (32)     (32)
+--------+--------+--------+
|    1   |    2   |    3   |  (1)
+--------+--------+--------+
|    4   |    5   |    6   |  (1)
+--------+--------+--------+

A set of 32 pixels as numbered on the diagram above is called a position. This is an important concept for the rendering algorithm. For each position, the data of all layers is stored in rendering order, so the layers are interwoven in the storage. It also means that the data for a position will consist of several longwords, not just one.

Note that extending the image to a multiple of 32 in width is not a hard requirement, it can be avoided by defining and implementing 16-bit and 8-bit positions, but this is currently not done.

Along with this data, the image object contains a number of attributes:

typedef struct
{
  /* Image can only be rendered with the gray engine */
  uint gray     :1;
  /* Left for future use */
  uint          :3;
  /* Image profile (uniquely identifies a rendering function) */
  uint profile  :4;
  /* Full width, in pixels */
  uint width    :12;
  /* Full height, in pixels */
  uint height   :12;

  /* Raw layer data */
  uint8_t data[];

} GPACKED(4) image_t;

The first byte indicates the color profile and whether this profile is gray-only. width and height are the natural dimensions of the image, before width extension (which is only relevant for storage). The number of columns is deduced from the width.

Rendering algorithm

The rendering algorithm takes as parameter a subrectangle of an image and a target position on the VRAM. Drawing a subrectangle instead of the whole image makes it trivial to do clipping by just cutting whatever goes beyond the screen out of the source area.

Two functions are available at this level:

  • bopti_render_clip() clips the provided subrectangle to the image dimensions, then clips that to the screen, and renders. This is the default but all the checks take some time to perform.
  • bopti_render_noclip() directly renders by assuming that the subrectangle is valid and that the render fully fits into the VRAM. In many situations these assumptions are known so it can be used by passing DIMAGE_NOCLIP to dsubimage() to spare time.

After adjusting (or not) coordinates, both of these functions fall to the next level. Rectangle masks are computed to indicate which part of the VRAM must or not be affected. (This is because everything will be manipulated with longwords from now on, and rendering boundaries will fall in the middle of them.)

Since the masks prevent us from painting outside of the target area, we can now relax our source rectangle. Instead of pixels, we can now consider full 32-bit positions. We'll render each of them on the VRAM using the color profile function then move on until the image is complete.

Two functions are used for this task:

  • bopti_render() does the prep work and parameter computation.
  • bopti_grid() iterates over positions and calls the profiles's renderer.

The last level is the profile renderer, which is implemented in assembler. These are functions that take as parameter the current VRAM values, a pointer to image data, a pointer to rectangle masks, the x-position of the blit, and return new VRAM values.

Note that a single position will generally intersect two VRAM longwords because the x-coordinate supplied by the user can be arbitrary. A fair amount of shifting in involved to position the position (hence the name) along the proper x coordinate, then render. Rectangle masks are aligned on the same x-coordinate as the VRAM so we don't have to shift them. In general, this will look like this:

 <--- Preserved area ---><----------- Rendered area ----------->

+----------- VRAM 1 ------------+----------- VRAM 2 ------------+
| ####################### # # # | # # # # # # # # # ########### |
+-------------------------------+-------------------------------+
                         |                          |
+----------- Mask 1 ------------+----------- Mask 2 ------------+
|                        ###### | ############################# |
+-------------------------------+-------------------------------+
                    |                               |
                    +---------- Position -----------+
<---- x offset ---->| # # # # # # # # # # # # # # # |
                    +-------------------------------+

There are two types of such functions:

  • bopti_asm_* for the mono and mono_alpha profiles, on a single VRAM (but still with two VRAM longwords because of positioning).
  • bopti_gasm_* for all four profiles, on two VRAMs (for a total of four VRAM longwords).