Lephenixnoir 849121e0b0 | ||
---|---|---|
README.md |
README.md
Concept for an fx-9860G emulator for fx-CG 50
Goal: an fx-CG add-in (.g3a) that can run fx-9860G add-ins (.g1a).
⚠️ WARNING: This is a feasability study/thought experiment. I do not guarantee that there'll ever be a working result.
Vocabulary:
- “host”: the fx-CG serving as an emulator.
- “target“: the g1a file being emulated.
- “target source”: original code from the g1a file.
- “target code“: recompiled basic blocks from the target source.
Overview
In order to emulate a g1a, we need to emulate the resources that it uses.
Resource | Proposed design |
---|---|
CPU | Basic block level recompilation (QEMU-style) |
Memory | Dynamic remap in target code with widget (bite the bullet) |
MPU | Intercepted memory accesses; case-by-case owned or shared |
Syscalls | Emulated or translated, partial support |
Interrupts | Redirected to the emulator context |
CPU
Recompilation at the basic block level.
Main loop
The main loop of the emulator runs individual blocks and gets back jump targets.
uint32_t pc = 0x00300200;
while(true) {
basic_block_t *bb = find_basic_block(pc);
if(!bb)
bb = compile_basic_block(g1a, pc);
uint32_t target = run_basic_block(bb);
pc = update_pc(pc, target);
}
This pseudo-function must be written in assembler in order to most efficiently flow from one block to the next, and avoid context switches at all costs. Both finding and jumping to basic blocks should be done without saving the entire CPU context. We only go back to C code if a new basic block has to be compiled.
If context switches turn out to be too expensive and/or stuttery, the emulator
can eagerly precompile statically-reachable sections of the target source by
performing flow analysis / CFG reconstructions (compiler.precompile
). This
should catch most code, leaving only function pointers and jump tables behind.
Target basic blocks
Basic blocks start at whichever PC caused their access, and stop when encountering one of the following instructions:
- Unconditional jumps:
bra
,braf
,jmp
- Function calls:
bsr
,bsrf
,jsr
- Returns:
rte
,rts
They do not, maybe counter-intuitively, stop at the following fall-through instructions:
- Conditional jumps:
bf
,bf.s
,bt
,bt.s
This is because it is likely faster to fall through without going through the emulator's main loop by using the following type of widget.
nop1
bf bb_pc-12
nop2
bra bb_pc+28
~~>
nop1
bf 1f
nop2
rts
mov #28, <scrap>
#---
1: rts
mov #-12, <scrap>
Target code blocks are functions returning either offsets or new addresses.
There doesn't appear to be one best format; offsets are way easier to program
when recompiling, but absolute addresses must be supported, eg. for jmp
.
Suggested approach for the semantics of the return value:
- any even number is an absolute PC;
- an odd number is the distance from the PC of the block + 1;
rts
is handled by returning capturedpr
;rte
is handled by a widget.
Reasoning: assuming no completely bugged branches we can exploit the odd bit. PC-relative branches almost always use constant distances, so there are many more opportunities to factor the +1 in the constant compared to absolute jumps, for which a register is typically used.
This can be decoded pretty efficiently with few registers.
# Input: r13 is the value being tested
# Context saving required: pc value, T bit
# Fastest path for relative branches
shar <retval>
bf 1f
add <retval>, <pc>
bra .next_iteration
add <retval>, <pc>
1f: bra .next_iteration
mov <retval>, <pc>
Note that in the case of jump tables like Duff's device, blocks can overlap. This is annoying to deal with because the compiler is not expected to keep track of where in each target code block every source address lands. As a first approximation, this can be ignored completely, with each sub-block being recompiled independently. This just creates overlapping blocks, which is functionally not a problem.
It would still be beneficial to track whether such overlapping blocks exist
(tracking.block_overlaps
), mainly because that'd be necessary to correctly
invalidate them after a write to code areas in self-modifying code. I'd like to
say this is way off, but gint uses self-modifying code for
interrupt-cancellable sleeps. Speaking of which, we should track that too
(tracking.self_modifying_code
).
TODO: Reordering delay slots with final steps
Captured registers
Target code must have access to registers for temporary data, function calls to runtime support from the emulator, and other checks. Instead of dynamically reallocating registers like QEMU's TCG does, the emulator uses fixed registers and rewrites accesses to these registers.
The following registers are used in target code for a purpose different than they do in target source:
r12
is scrap register 1.r13
is scrap register 2.r14
is the meta register.pr
is overwritten all the time by emulator logic.vbr
is replaced so interrupts can be caught and scheduled.
The meta register is a pointer to a storage area where the emulator keeps commonly-used variables.
Using gbr
as meta register was considered due to its flexible addressing
capabilities, however instructions that use gbr
cannot easily be translated
into instructions that use r12
or r13
specifically because of gbr
's
flexible addressing capabilities.
Note that there is no OS vbr
since we emulate syscalls; we don't use OS code
at all. We couldn't carry a copy of the fx-9860G OS anyway.
Accesses to captured registers go through the meta storage as they are loaded temporarily into a scrap register.
add #1, r14
~~>
mov.l @(<offset>, <meta>), <scrap>
add #1, <scrap>
mov.l <scrap>, @(<offset>, <meta>)
The minimum number of scrap registers required is the maximum number of register operands to a single instruction, which is 2. This maximum is reached with instructions whose operands are all captured registers.
Meta storage
- Captured registers:
r12
,r13
(scrap registers 1 and 2)r14
(meta register)pr
(overwritten by emulator internals)vbr
(overwritten to capture interrupts)spc
and possiblyssr
- Data for context switches (all callee-saved registers that are not already
captured:
r0
throughr7
,mach
,macl
)
Probably more (TODO)
Be careful: offsets for mov.l (<offset>, rm), rn
are limited to 64 bytes!
So far only r12
, r13
, r14
, pr
are so important as to requiring a
single-instruction. The rest is used rarely enough or even only during context
switches.
Meta stack
Target code widgets will probably need to perform computations using temporary
registers beyond the scrap registers, or use the T bit for comparisons, which
is global state stored in sr
.
(Note: save T with movt rX
, restore it with tst rX, rX
or shlr rX
.)
Any register saving for the purposes of such computations cannot occur on the
user stack (r15
) because interrupts might occur in the middle of the block
(unless the idea of scheduling them between blocks works, TODO).
In this case, it might be possible to use the space above the meta storage as a stack (the “meta stack”), using the fact that accesses to the meta storage using constant indices can simply be updated to reflect pointer movement since data usage on the meta stack would be statically known when recompiling.
mov.l r0, @-<meta>
mov.l @(12, <meta>), <scrap> # access to meta+8 shifted
# ...
mov.l @<meta>+, r0
Would there be access to non-constant offsets in meta storage? (TODO)
Context switches
To context switch between target code and internal emulator code, all registers not preserved by C calling conventions must be saved. Hopefully that's pretty much it.
Widgets and tools for writing widgets
Main widgets:
- Memory access
- Load/store captured registers
- Load block-return-value constants
Basic block storage
4-level 32/256/256/128-entry radix tree is tempting. But that's one (hopefully 1-byte entry) per every 2 bytes in the program, which is a lot.
More specifically, the first 3 levels are page-table-style normal radix trees. The last level is a structure
struct {
uint8_t offsets[128];
basic_block_t *bbs[];
};
where every PC value is associated a single-byte entry number within the bbs
array. It's easy to keep track of the highest entry number to add more as
needed, however the storage space for bbs
is likely going to be a problem.
With eager precompilation it could work because we can leave space just in case
but otherwise know in advance how much space is needed.
Variation: the single-byte entry number gets added to a base entry number (a multiple of 256) to access a 65536-entry basic block pools. The only constraint is that each leaf table (covering 256 bytes/128 instructions) always has to fit within a single 128-segment in the pool. Technically not hard, but if we always add to the end all the 128-segments will get filled...
Overview of instructions
Instructions with specific emulation needs (widgets or adjustments):
bf
,bf.s
,bt
,bt.s
: Widgets, with extra code at end of blockbra
,braf
,bsr
,bsfr
,jmp
,jsr
,rte
,rts
: Block terminatorsldc.l
,lds.l
,stc.l
,sts.l
: Memory widgetmov
(PC-relative): New offsets must be computedmov
(register-relative, GBR-relative),movua.l
: Memory widgetand.b
,or.b
,tst.b
,xor.b
: Memory widget + captured GBRtas.b
: Atomicity requirement!
Instructions with no semantic changes beyond access to captured registers:
add
,addc
,addv
,and
,cmp/*
,div0u
,dmuls.l
,dmuls.u
,dt
,exts
,extu
,mac.l
,mac.w
,movt
,mul.l
,muls.w
,mulu.w
,neg
,negc
,or
,rotcl
,rotcr
,rotl
,rotr
,shad
,shal
,shar
,shld
,shll*
,shlr*
,sub
,subc
,subv
,swap.b
,swap.w
,tst
,xor
,xtrct
: Arithmeticmov
(MT)ldc
,lds
,stc
,sts
Transparent instructions:
clrmac
,clrs
,clrt
,div0s
,div0u
,nop
,sets
,sett
,sleep
,synco
,trapa
Others:
icbi
,ocbi
,ocbp
,ocbwbe
: Unclear what to do with these (TODO how to fulfill expectations of cache/DMA coherence code?)pref
,prefi
: Same thing as aboveldtlb
: Hopefully unused, otherwise dynamic mapping (TODO)movca.l
,movco.l
,movli.l
: Think about these later (TODO)- DSP instructions (TODO)
Memory
In general, memory needs to be dynamically re-mapped at every access. Replacing pointers in the target code context breaks pointer arithmetic unless in very controlled cases of correct object access, which is a possible optimization but definitely not a safe one for any program.
Memory map
Region | Remapped to |
---|---|
0x00300000 |
Code: dynamic recompiler table. Data: g1a loaded to RAM. |
0x08100000 |
Fixed buffer in host memory. |
On-chip memory | Itself. |
P1/P2 URAM | Emulate MMU for gint's auto-detection. |
MPU registers | Intercepted manually. |
OS land | Refused/crash. |
Memory widget
TODO.
Basic idea: 256-entry table based on the high 8 bits of the address.
0x00
: P0 Code0x08
: P0 User RAM0x8*
: P1 ROM0x88
: P1 RAM0xa*
: P2 ROM0xa8
: P2 RAM0xe5
: ILRAM, XRAM, or YRAM0xfd
: RSRAM (or ECC registers)- Other : MPU registers
MPU
TODO
Basic idea: some shared, some emulated.
- CPG: Shared.
- DMA: Shared unless a region is remapped to somewhere else that doesn't support the DMA in the same way. Instant DMA is always a sensible default.
- INTC: Shared; the emulator shouldn't need interrupts, except maybe the UBC.
- KEYSC: Shared.
- MMU: Emulated.
- RTC: Shared.
- T6K11: Emulated.
- TMU/ETMU: Shared.
- USB: Shared but I'd be very surprised if it ever worked.
Syscalls
TODO (emulate common ones, crash on others)
Interrupts
TODO (schedule them at the end of blocks? because issues: breaking atomic instructions and interrupting the recompiler. will that work with stuff like sleep and other interrput-aware code or busy waiting loops?)
Summary and other remarks
Automatic optimization hints
There are many recompilation or emulation optimizations that can be performed if the target happens to behave well in some regard. For example, if all memory access using r15 are in the stack region, if the code segment is never accessed by a data access, etc.
The emulator can monitor properties of interest for optimization (with an
option for the ones that slow down recompiled code) and after execution deliver
a hint about good behaviors that appear to be followed and can lead to
optimizations (tracking.optimizations
).
Automatic detection of targeted calculator model
I'm pretty sure it's possible to start running in dual SH3/SH4 mode and later specialize into either depending on what the program does.
Conventions for the display can also probably be detected so that programs for both the G-II and G-III series can be executed transparently.
Dataflow analysis for captured registers
If a single captured register is used over a section of code, it is more efficient to not do a load/store cycle at each use.
mov r8, r13
add #4, r13
~~>
mov.l @(<offset>, <meta>), <scrap>
mov r8, <scrap>
mov.l <scrap>, @(<offset>, <meta>)
mov.l @(<offset>, <meta>), <scrap>
add #4, <scrap>
mov.l <scrap>, @(<offset>, <meta>)
~~>
mov.l @(<offset>, <meta>), <scrap>
mov r8, <scrap>
add #4, <scrap>
mov.l <scrap>, @(<offset>, <meta>)
Summary of options
tracking.optimizations
: Enables target code widgets that monitor optimization opportunitiestracking.block_overlaps
: Tracks recompiled blocks that overlap (typically from jump tables)tracking.self_modifying_code
: Tracks whether self-modifying code is usedcompiler.precompile
: Eagerly compile blocks reachable frommain()
before starting execution
Other options not yet named in the document (TODO):
- Put the stack in on-chip memory if on-chip memory isn't used by the program
- Do not use the memory widget for simple acceses to r15