Go to file

Lephenixnoir 849121e0b0 meta: initial thoughts		2023-07-25 00:47:34 +02:00
README.md	meta: initial thoughts	2023-07-25 00:47:34 +02:00

README.md

Concept for an fx-9860G emulator for fx-CG 50

Goal: an fx-CG add-in (.g3a) that can run fx-9860G add-ins (.g1a).

⚠️ WARNING: This is a feasability study/thought experiment. I do not guarantee that there'll ever be a working result.

Vocabulary:

“host”: the fx-CG serving as an emulator.
“target“: the g1a file being emulated.
“target source”: original code from the g1a file.
“target code“: recompiled basic blocks from the target source.

Overview

In order to emulate a g1a, we need to emulate the resources that it uses.

Resource	Proposed design
CPU	Basic block level recompilation (QEMU-style)
Memory	Dynamic remap in target code with widget (bite the bullet)
MPU	Intercepted memory accesses; case-by-case owned or shared
Syscalls	Emulated or translated, partial support
Interrupts	Redirected to the emulator context

CPU

Recompilation at the basic block level.

Main loop

The main loop of the emulator runs individual blocks and gets back jump targets.

uint32_t pc = 0x00300200;
while(true) {
    basic_block_t *bb = find_basic_block(pc);
    if(!bb)
        bb = compile_basic_block(g1a, pc);
    uint32_t target = run_basic_block(bb);
    pc = update_pc(pc, target);
}

This pseudo-function must be written in assembler in order to most efficiently flow from one block to the next, and avoid context switches at all costs. Both finding and jumping to basic blocks should be done without saving the entire CPU context. We only go back to C code if a new basic block has to be compiled.

If context switches turn out to be too expensive and/or stuttery, the emulator can eagerly precompile statically-reachable sections of the target source by performing flow analysis / CFG reconstructions (compiler.precompile). This should catch most code, leaving only function pointers and jump tables behind.

Target basic blocks

Basic blocks start at whichever PC caused their access, and stop when encountering one of the following instructions:

Unconditional jumps: bra, braf, jmp
Function calls: bsr, bsrf, jsr
Returns: rte, rts

They do not, maybe counter-intuitively, stop at the following fall-through instructions:

Conditional jumps: bf, bf.s, bt, bt.s

This is because it is likely faster to fall through without going through the emulator's main loop by using the following type of widget.

    nop1
    bf      bb_pc-12
    nop2
    bra     bb_pc+28
~~>
    nop1
    bf      1f
    nop2
    rts
    mov     #28, <scrap>
    #---
1:  rts
    mov     #-12, <scrap>

Target code blocks are functions returning either offsets or new addresses. There doesn't appear to be one best format; offsets are way easier to program when recompiling, but absolute addresses must be supported, eg. for jmp.

Suggested approach for the semantics of the return value:

any even number is an absolute PC;
an odd number is the distance from the PC of the block + 1;
rts is handled by returning captured pr;
rte is handled by a widget.

Reasoning: assuming no completely bugged branches we can exploit the odd bit. PC-relative branches almost always use constant distances, so there are many more opportunities to factor the +1 in the constant compared to absolute jumps, for which a register is typically used.

This can be decoded pretty efficiently with few registers.

# Input: r13 is the value being tested
# Context saving required: pc value, T bit
# Fastest path for relative branches

    shar    <retval>
    bf      1f
    add     <retval>, <pc>
    bra     .next_iteration
    add     <retval>, <pc>
1f: bra     .next_iteration
    mov     <retval>, <pc>

Note that in the case of jump tables like Duff's device, blocks can overlap. This is annoying to deal with because the compiler is not expected to keep track of where in each target code block every source address lands. As a first approximation, this can be ignored completely, with each sub-block being recompiled independently. This just creates overlapping blocks, which is functionally not a problem.

It would still be beneficial to track whether such overlapping blocks exist (tracking.block_overlaps), mainly because that'd be necessary to correctly invalidate them after a write to code areas in self-modifying code. I'd like to say this is way off, but gint uses self-modifying code for interrupt-cancellable sleeps. Speaking of which, we should track that too (tracking.self_modifying_code).

TODO: Reordering delay slots with final steps

Captured registers

Target code must have access to registers for temporary data, function calls to runtime support from the emulator, and other checks. Instead of dynamically reallocating registers like QEMU's TCG does, the emulator uses fixed registers and rewrites accesses to these registers.

The following registers are used in target code for a purpose different than they do in target source:

r12 is scrap register 1.
r13 is scrap register 2.
r14 is the meta register.
pr is overwritten all the time by emulator logic.
vbr is replaced so interrupts can be caught and scheduled.

The meta register is a pointer to a storage area where the emulator keeps commonly-used variables.

Using gbr as meta register was considered due to its flexible addressing capabilities, however instructions that use gbr cannot easily be translated into instructions that use r12 or r13 specifically because of gbr's flexible addressing capabilities.

Note that there is no OS vbr since we emulate syscalls; we don't use OS code at all. We couldn't carry a copy of the fx-9860G OS anyway.

Accesses to captured registers go through the meta storage as they are loaded temporarily into a scrap register.

    add     #1, r14
~~>
    mov.l   @(<offset>, <meta>), <scrap>
    add     #1, <scrap>
    mov.l   <scrap>, @(<offset>, <meta>)

The minimum number of scrap registers required is the maximum number of register operands to a single instruction, which is 2. This maximum is reached with instructions whose operands are all captured registers.

Meta storage

Captured registers:
- r12, r13 (scrap registers 1 and 2)
- r14 (meta register)
- pr (overwritten by emulator internals)
- vbr (overwritten to capture interrupts)
- spc and possibly ssr
Data for context switches (all callee-saved registers that are not already captured: r0 through r7, mach, macl)

Probably more (TODO)

Be careful: offsets for mov.l (<offset>, rm), rn are limited to 64 bytes! So far only r12, r13, r14, pr are so important as to requiring a single-instruction. The rest is used rarely enough or even only during context switches.

Meta stack

Target code widgets will probably need to perform computations using temporary registers beyond the scrap registers, or use the T bit for comparisons, which is global state stored in sr.

(Note: save T with movt rX, restore it with tst rX, rX or shlr rX.)

Any register saving for the purposes of such computations cannot occur on the user stack (r15) because interrupts might occur in the middle of the block (unless the idea of scheduling them between blocks works, TODO).

In this case, it might be possible to use the space above the meta storage as a stack (the “meta stack”), using the fact that accesses to the meta storage using constant indices can simply be updated to reflect pointer movement since data usage on the meta stack would be statically known when recompiling.

    mov.l   r0, @-<meta>
    mov.l   @(12, <meta>), <scrap>      # access to meta+8 shifted
    # ...
    mov.l   @<meta>+, r0

Would there be access to non-constant offsets in meta storage? (TODO)

Context switches

To context switch between target code and internal emulator code, all registers not preserved by C calling conventions must be saved. Hopefully that's pretty much it.

Widgets and tools for writing widgets

Main widgets:

Memory access
Load/store captured registers
Load block-return-value constants

Basic block storage

4-level 32/256/256/128-entry radix tree is tempting. But that's one (hopefully 1-byte entry) per every 2 bytes in the program, which is a lot.

More specifically, the first 3 levels are page-table-style normal radix trees. The last level is a structure

struct {
    uint8_t offsets[128];
    basic_block_t *bbs[];
};

where every PC value is associated a single-byte entry number within the bbs array. It's easy to keep track of the highest entry number to add more as needed, however the storage space for bbs is likely going to be a problem. With eager precompilation it could work because we can leave space just in case but otherwise know in advance how much space is needed.

Variation: the single-byte entry number gets added to a base entry number (a multiple of 256) to access a 65536-entry basic block pools. The only constraint is that each leaf table (covering 256 bytes/128 instructions) always has to fit within a single 128-segment in the pool. Technically not hard, but if we always add to the end all the 128-segments will get filled...

Overview of instructions

Instructions with specific emulation needs (widgets or adjustments):

bf, bf.s, bt, bt.s: Widgets, with extra code at end of block
bra, braf, bsr, bsfr, jmp, jsr, rte, rts: Block terminators
ldc.l, lds.l, stc.l, sts.l: Memory widget
mov (PC-relative): New offsets must be computed
mov (register-relative, GBR-relative), movua.l: Memory widget
and.b, or.b, tst.b, xor.b: Memory widget + captured GBR
tas.b: Atomicity requirement!

Instructions with no semantic changes beyond access to captured registers:

add, addc, addv, and, cmp/*, div0u, dmuls.l, dmuls.u, dt, exts, extu, mac.l, mac.w, movt, mul.l, muls.w, mulu.w, neg, negc, or, rotcl, rotcr, rotl, rotr, shad, shal, shar, shld, shll*, shlr*, sub, subc, subv, swap.b, swap.w, tst, xor, xtrct: Arithmetic
mov (MT)
ldc, lds, stc, sts

Transparent instructions:

clrmac, clrs, clrt, div0s, div0u, nop, sets, sett, sleep, synco, trapa

Others:

icbi, ocbi, ocbp, ocbwbe: Unclear what to do with these (TODO how to fulfill expectations of cache/DMA coherence code?)
pref, prefi: Same thing as above
ldtlb: Hopefully unused, otherwise dynamic mapping (TODO)
movca.l, movco.l, movli.l: Think about these later (TODO)
DSP instructions (TODO)

Memory

In general, memory needs to be dynamically re-mapped at every access. Replacing pointers in the target code context breaks pointer arithmetic unless in very controlled cases of correct object access, which is a possible optimization but definitely not a safe one for any program.

Memory map

Region	Remapped to
`0x00300000`	Code: dynamic recompiler table. Data: g1a loaded to RAM.
`0x08100000`	Fixed buffer in host memory.
On-chip memory	Itself.
P1/P2 URAM	Emulate MMU for gint's auto-detection.
MPU registers	Intercepted manually.
OS land	Refused/crash.

TODO.

Basic idea: 256-entry table based on the high 8 bits of the address.

0x00 : P0 Code
0x08 : P0 User RAM
0x8* : P1 ROM
0x88 : P1 RAM
0xa* : P2 ROM
0xa8 : P2 RAM
0xe5 : ILRAM, XRAM, or YRAM
0xfd : RSRAM (or ECC registers)
Other : MPU registers

MPU

TODO

Basic idea: some shared, some emulated.

CPG: Shared.
DMA: Shared unless a region is remapped to somewhere else that doesn't support the DMA in the same way. Instant DMA is always a sensible default.
INTC: Shared; the emulator shouldn't need interrupts, except maybe the UBC.
KEYSC: Shared.
MMU: Emulated.
RTC: Shared.
T6K11: Emulated.
TMU/ETMU: Shared.
USB: Shared but I'd be very surprised if it ever worked.

Syscalls

TODO (emulate common ones, crash on others)

Interrupts

TODO (schedule them at the end of blocks? because issues: breaking atomic instructions and interrupting the recompiler. will that work with stuff like sleep and other interrput-aware code or busy waiting loops?)

Summary and other remarks

Automatic optimization hints

There are many recompilation or emulation optimizations that can be performed if the target happens to behave well in some regard. For example, if all memory access using r15 are in the stack region, if the code segment is never accessed by a data access, etc.

The emulator can monitor properties of interest for optimization (with an option for the ones that slow down recompiled code) and after execution deliver a hint about good behaviors that appear to be followed and can lead to optimizations (tracking.optimizations).

Automatic detection of targeted calculator model

I'm pretty sure it's possible to start running in dual SH3/SH4 mode and later specialize into either depending on what the program does.

Conventions for the display can also probably be detected so that programs for both the G-II and G-III series can be executed transparently.

Dataflow analysis for captured registers

If a single captured register is used over a section of code, it is more efficient to not do a load/store cycle at each use.

    mov     r8, r13
    add     #4, r13
~~>
    mov.l   @(<offset>, <meta>), <scrap>
    mov     r8, <scrap>
    mov.l   <scrap>, @(<offset>, <meta>)
    mov.l   @(<offset>, <meta>), <scrap>
    add     #4, <scrap>
    mov.l   <scrap>, @(<offset>, <meta>)
~~>
    mov.l   @(<offset>, <meta>), <scrap>
    mov     r8, <scrap>
    add     #4, <scrap>
    mov.l   <scrap>, @(<offset>, <meta>)

Summary of options

tracking.optimizations: Enables target code widgets that monitor optimization opportunities
tracking.block_overlaps: Tracks recompiled blocks that overlap (typically from jump tables)
tracking.self_modifying_code: Tracks whether self-modifying code is used
compiler.precompile: Eagerly compile blocks reachable from main() before starting execution

Other options not yet named in the document (TODO):

Put the stack in on-chip memory if on-chip memory isn't used by the program
Do not use the memory widget for simple acceses to r15