fxos/doc/functions.md

11 KiB

Functions

Probably the most common object of interest is code. In fxos, the “proper” way to deal with code is through functions. Random instructions not tied to functions have much less support in terms of tooling and analysis.

This document describes structures defined in

#include <fxos/function.h>

Navigating functions, basic blocks and instructions

Functions in fxos are stored as Control Flow Graphs (CFG), which is a friendly format for analysis. The function itself is split into basic blocks, each consisting of a straight series of instructions terminated by an explicit or implicit jump to one or two other blocks. In essence, a basic block is the largest unit of sequential code that you can find in a function.

The Function structure

The Function structure is a BinaryObject, so it always lives in a binary that can be found with .parentBinary(). It also has the usual .address(), .size(), and name/comments.

When iterated with .begin()/.end(), it produces references to its basic blocks in an arbitrary order:

for(BasicBlock &bb: function) {
    /* ... */
}

Blocks are numbered from 0 to .blockCount() and can be accessed individually with .basicBlockByIndex() or, when you know the address, .basicBlockByAddress(). The function's entry block index and the block itself can be found with .entryBlockIndex() and .entryBlock().

The BasicBlock structure

The BasicBlock structure represents a node in the CFG. It always exists in the context of a function, which can be found with .parentFunction(). The binary that owns the function is also available as .parentBinary().

The block has its own .address() which is the address of its first instruction, and it knows its own block number within the parent function, which is available as .blockIndex().

Accessing instructions

Blocks have a .instructionCount() and the instructions can be iterated over in increasing address order with the view .instructionsInAddressOrder():

/* random_access_range */
for(Instruction &insn: bb.instructionsInAddressOrder()) {
    /* ... eg. mov #0, r0; rts; nop */
}

Individual instructions can also be found with .instructionAtIndex().

A major subtlety of the SuperH ISA is that most branch and function call instructions have delay slots, meaning that the CPU fills in a pipeline bubble by executing the instruction following the branch/call while it's fetching the first few instructions at the branch target. This makes jumps-and-delay-slot pairs unique instructions that are not equivalent to any two-instruction sequence.

The process of executing a jump-and-delay-slot pair is as follows:

  1. Compute the branch target using the initial state.
  2. Set PC to that target (and start fetching the code there).
  3. Run the delay slot instruction.

This means that the delay slot instructions runs "before" the jump but does not affect the jump target. It also runs with the target PC, so PC-relative instructions don't behave in the natural way. (This is not much of a problem in practice because most are illegal as delay slots anyway and I don't think any compiler abuses this for optimization.)

Instructions that possess delay slots can be identified with .opcode().hasDelaySlot() (see Instruction below). Additionally, the basic block provides .instructionsAndDelaySlots() which iterates over instructions as pairs. Instructions without a delay slot are returned as {ins, nullptr}, while instructions with delay slots are returned as {ins, &delaySlotIns}. As a result, pairs are returned in execution order; the user simply has to handle delaySlotIns before ins when the former is not null.

/* input_range */
for(auto [ins, delaySlotIns]: bb.instructionsAndDelaySlots()) {
    // ins : Instruction &
    // delaySlotIns : Instruction *
    /* ... eg. {(mov #0, r0), nullptr}; {rts, nop} */
}

TODO: Other iteration methods

Block endings

Basic blocks can end either explicitly because of a general jump/branch/return instruction, or implicitly by falling through to the next block. As an illustration of both, consider the following CFG for an if/else statement:

         ╒═══════════════╕
 .entry: │ cmp/eq r4, r5 │     true
         │ bf .false     │─────────────────╮
         ╘═══════════════╛         ╒═══════════════╕
           false │          .true: │ mov #4, r0    │
                 ↓                 │ bra .end      │
         ╒═══════════════╕         ╘═══════════════╛
 .false: │ mov #7, r0    │                │
         ╘═══════════════╛                │
     fallthrough ↓                        │
         ╒═══════════════╕                │
   .end: │ rts           │←───────────────╯
         │ nop           │
         ╘═══════════════╛

The .entry block is terminated by a bf and the .true block is terminated by a bra, so both end explicitly. By contrast, the .false block doesn't have a terminator; when executed, it merely flows into (“falls through” to) .end. It is not possible to merge .false and .end into a single block because the jump at the end of .true jumps to .end, and in a CFG jump targets must be at the beginning of blocks. Note that .entry might also fall through to .true when r4 != r5, which shows that blocks with conditional terminators can end in multiple ways.

The function .terminatorInstruction() returns a pointer to the terminator of a block, nullptr if there is none; the latter case is also indicated by the function .hasNoTerminator().

When accounting for the nature of the terminator instructions, after a basic block ends one of four things can happen:

  1. Static branch: branch to another block in the same function, whose address is statically known.
    • .mayStaticBranch() and .mustStaticBranch() indicate this ending
    • .staticBranchTarget() returns the address of the next block
  2. Fallthrough to the next block.
    • .mayFallthrough() and .mustFallthrough() indicate this ending
    • .fallthroughTarget() returns the address of the next block
  3. Return from the function.
    • .mustReturn() indicates this ending (there are no conditional returns)
  4. Perform a tail call and jump somewhere else dynamic in another function.
    • .mustDynamicBranch() indicates this ending (there are no conditional dynamic branches)
    • This also covers, as a default, things like jump tables where we don't know if we leave the function or not. This is due to imperfect function reconstruction.

Depending on the terminator instruction (or absence thereof), one or more endings are possible for a block. The table below lists all types of terminators and how they are identified in terms of AsmInstruction.

Block ending Corresponding terminators Terminator identification .mayStaticBranch() .mayFallthrough() .mustReturn() .mustDynamicBranch()
Unconditional branch bra .isUnconditionalJump() true false false false
Conditional branch bf.s, bt.s, bf, bt .isConditionalJump() true true false false
Tail call jmp @rn, braf rm .isDynamicJump() false false true true
Function return rte, rts .isReturn() false false false false
Fallthrough No terminator No terminator false true false false

.mustStaticBranch() and .mustFallthrough() return true for a given basic block if the only may-function which returns true for that block is .mayStaticBranch() and .mayFallthrough(), respectively.

Navigation in the CFG mostly consists in querying the block's successors (ie. which blocks this block can statically branch to) and its predecessors (which blocks statically branch to it). The number of successors and predecessors can be found with .successorCount() and .predecessorCount(), and the blocks themselves can be obtained with a few different views:

/* Get blocks by reference: .successors(), .predecessors() */
for(BasicBlock &succ: bb.successors()) {
    /* ... */
}
/* Get their addresses: .successorsByAddress(), .predecessorsByAddress() */
for(u32 address: bb.successorsByAddress()) {
    /* ... */
}
/* Get their indices within the shared parent function: .successorsByIndex(),
   .predecessorsByIndex() */
for(uint index: bb.successorsByIndex()) {
    /* ... */
}

The methods .isEntryBlock() and .isTerminator() identify the first block in the function and all blocks that might return (the second is equivalent to .mustReturn()).

The Instruction structure

The Instruction structure represents a single instruction, within the context of a function. The basic block, function and binary owning it can be queried with .parentBlock(), .parentFunction() and .parentBinary().

This structure is instantiated in RAM for every single instruction registered as part of a function (an order of magnitude is several millions for a standard OS binary) so this structure keeps a minimal number of attributes. In particular, analysis results are not stored here; they are stored at the binary/function level instead (as appropriate) using compact formats when needed.

The instruction's opcode can be accessed with .opcode(), and this can be used to check if the instruction is a branch, a memory access, what its operands are, etc. Instruction only tracks the context and analysis results. Along with the opcode, .size() will give the instruction's size in bytes (which is usually 2 but can be 4 for DSP instructions).

The instruction has its own .address() and its relationship to other instructions in its block can be found with .indexInBlock(), .isFirstInBlock(), .isLastInBlock() and .isInDelaySlot(). Note that again, due to delay slots, being last and being a jump are not the same thing.

Function analysis

TODO:

  • Function prototypes
  • References
  • Cross-references
  • Dominators and post-dominators

TODO: Abstract interpretation info