## Functions Probably the most common object of interest is code. In fxos, the “proper” way to deal with code is through functions. Random instructions not tied to functions have much less support in terms of tooling and analysis. This document describes structures defined in ```cpp #include ``` ### Navigating functions, basic blocks and instructions Functions in fxos are stored as [Control Flow Graphs](https://en.wikipedia.org/wiki/Control-flow_graph) (CFG), which is a friendly format for analysis. The function itself is split into _basic blocks_, each consisting of a straight series of instructions terminated by an explicit or implicit jump to one or two other blocks. In essence, a basic block is the largest unit of sequential code that you can find in a function. #### The `Function` structure The `Function` structure is a `BinaryObject`, so it always lives in a binary that can be found with `.parentBinary()`. It also has the usual `.address()`, `.size()`, and name/comments. When iterated with `.begin()`/`.end()`, it produces references to its basic blocks in an arbitrary order: ```cpp for(BasicBlock &bb: function) { /* ... */ } ``` Blocks are numbered from 0 to `.blockCount()` and can be accessed individually with `.basicBlockByIndex()` or, when you know the address, `.basicBlockByAddress()`. The function's entry block index and the block itself can be found with `.entryBlockIndex()` and `.entryBlock()`. #### The `BasicBlock` structure The `BasicBlock` structure represents a node in the CFG. It always exists in the context of a function, which can be found with `.parentFunction()`. The binary that owns the function is also available as `.parentBinary()`. The block has its own `.address()` which is the address of its first instruction, and it knows its own block number within the parent function, which is available as `.blockIndex()`. ##### Accessing instructions Blocks have a `.instructionCount()` and the instructions can be iterated over in increasing address order with the view `.instructionsInAddressOrder()`: ```cpp /* random_access_range */ for(Instruction &insn: bb.instructionsInAddressOrder()) { /* ... eg. mov #0, r0; rts; nop */ } ``` Individual instructions can also be found with `.instructionAtIndex()`. A major subtlety of the SuperH ISA is that most branch and function call instructions have delay slots, meaning that the CPU fills in a pipeline bubble by executing the instruction following the branch/call while it's fetching the first few instructions at the branch target. **This makes jumps-and-delay-slot pairs unique instructions** that are not equivalent to any two-instruction sequence. The process of executing a jump-and-delay-slot pair is as follows: 1. Compute the branch target using the initial state. 2. Set PC to that target (and start fetching the code there). 3. Run the delay slot instruction. This means that the delay slot instructions runs "before" the jump but does not affect the jump target. It also runs with the target PC, so PC-relative instructions don't behave in the natural way. (This is not much of a problem in practice because most are illegal as delay slots anyway and I don't think any compiler abuses this for optimization.) Instructions that possess delay slots can be identified with `.opcode().hasDelaySlot()` (see `Instruction` below). Additionally, the basic block provides `.instructionsAndDelaySlots()` which iterates over instructions as pairs. Instructions without a delay slot are returned as `{ins, nullptr}`, while instructions with delay slots are returned as `{ins, &delaySlotIns}`. As a result, pairs are returned in execution order; the user simply has to handle `delaySlotIns` before `ins` when the former is not null. ```cpp /* input_range */ for(auto [ins, delaySlotIns]: bb.instructionsAndDelaySlots()) { // ins : Instruction & // delaySlotIns : Instruction * /* ... eg. {(mov #0, r0), nullptr}; {rts, nop} */ } ``` TODO: Other iteration methods ##### Block endings Basic blocks can end either explicitly because of a general jump/branch/return instruction, or implicitly by falling through to the next block. As an illustration of both, consider the following CFG for an if/else statement: ``` ╒═══════════════╕ .entry: │ cmp/eq r4, r5 │ true │ bf .false │─────────────────╮ ╘═══════════════╛ ╒═══════════════╕ false │ .true: │ mov #4, r0 │ ↓ │ bra .end │ ╒═══════════════╕ ╘═══════════════╛ .false: │ mov #7, r0 │ │ ╘═══════════════╛ │ fallthrough ↓ │ ╒═══════════════╕ │ .end: │ rts │←───────────────╯ │ nop │ ╘═══════════════╛ ``` The `.entry` block is terminated by a `bf` and the `.true` block is terminated by a `bra`, so both end explicitly. By contrast, the `.false` block doesn't have a terminator; when executed, it merely flows into (“falls through” to) `.end`. It is not possible to merge `.false` and `.end` into a single block because the jump at the end of `.true` jumps to `.end`, and in a CFG jump targets must be at the beginning of blocks. Note that `.entry` might also fall through to `.true` when `r4 != r5`, which shows that blocks with conditional terminators can end in multiple ways. The function `.terminatorInstruction()` returns a pointer to the terminator of a block, `nullptr` if there is none; the latter case is also indicated by the function `.hasNoTerminator()`. When accounting for the nature of the terminator instructions, after a basic block ends one of four things can happen: 1. Static branch: branch to another block in the same function, whose address is statically known. - `.mayStaticBranch()` and `.mustStaticBranch()` indicate this ending - `.staticBranchTarget()` returns the address of the next block 2. Fallthrough to the next block. - `.mayFallthrough()` and `.mustFallthrough()` indicate this ending - `.fallthroughTarget()` returns the address of the next block 3. Return from the function. - `.mustReturn()` indicates this ending (there are no conditional returns) 4. Perform a tail call and jump somewhere else dynamic in another function. - `.mustDynamicBranch()` indicates this ending (there are no conditional dynamic branches) - This also covers, as a default, things like jump tables where we don't know if we leave the function or not. This is due to imperfect function reconstruction. Depending on the terminator instruction (or absence thereof), one or more endings are possible for a block. The table below lists all types of terminators and how they are identified in terms of `AsmInstruction`. | Block ending | Corresponding terminators | Terminator identification | `.mayStaticBranch()` | `.mayFallthrough()` | `.mustReturn()` | `.mustDynamicBranch()` | | -------------------- | -------------------------- | ------------------------- | -------------------- | ------------------- | ----------------- | ---------------------- | | Unconditional branch | `bra` | `.isUnconditionalJump()` | `true` | `false` | `false` | `false` | | Conditional branch | `bf.s`, `bt.s`, `bf`, `bt` | `.isConditionalJump()` | `true` | `true` | `false` | `false` | | Tail call | `jmp @rn`, `braf rm` | `.isDynamicJump()` | `false` | `false` | `true` | `true` | | Function return | `rte`, `rts` | `.isReturn()` | `false` | `false` | `false` | `false` | | Fallthrough | No terminator | No terminator | `false` | `true` | `false` | `false` | `.mustStaticBranch()` and `.mustFallthrough()` return true for a given basic block if the only may-function which returns true for that block is `.mayStaticBranch()` and `.mayFallthrough()`, respectively. ##### Accessing related blocks in the CFG Navigation in the CFG mostly consists in querying the block's successors (ie. which blocks this block can statically branch to) and its predecessors (which blocks statically branch to it). The number of successors and predecessors can be found with `.successorCount()` and `.predecessorCount()`, and the blocks themselves can be obtained with a few different views: ```cpp /* Get blocks by reference: .successors(), .predecessors() */ for(BasicBlock &succ: bb.successors()) { /* ... */ } /* Get their addresses: .successorsByAddress(), .predecessorsByAddress() */ for(u32 address: bb.successorsByAddress()) { /* ... */ } /* Get their indices within the shared parent function: .successorsByIndex(), .predecessorsByIndex() */ for(uint index: bb.successorsByIndex()) { /* ... */ } ``` The methods `.isEntryBlock()` and `.isTerminator()` identify the first block in the function and all blocks that might return (the second is equivalent to `.mustReturn()`). ### The `Instruction` structure The `Instruction` structure represents a single instruction, within the context of a function. The basic block, function and binary owning it can be queried with `.parentBlock()`, `.parentFunction()` and `.parentBinary()`. This structure is instantiated in RAM for every single instruction registered as part of a function (an order of magnitude is several millions for a standard OS binary) so this structure keeps a minimal number of attributes. In particular, analysis results are not stored here; they are stored at the binary/function level instead (as appropriate) using compact formats when needed. The instruction's opcode can be accessed with `.opcode()`, and this can be used to check if the instruction is a branch, a memory access, what its operands are, etc. `Instruction` only tracks the context and analysis results. Along with the opcode, `.size()` will give the instruction's size in bytes (which is usually 2 but can be 4 for DSP instructions). The instruction has its own `.address()` and its relationship to other instructions in its block can be found with `.indexInBlock()`, `.isFirstInBlock()`, `.isLastInBlock()` and `.isInDelaySlot()`. Note that again, due to delay slots, being last and being a jump are not the same thing. ### Function analysis TODO: - Function prototypes - References - Cross-references - Dominators and post-dominators TODO: Abstract interpretation info