fxos/doc/functions.md

5.8 KiB

Functions

Probably the most common object of interest is code. In fxos, the “proper” way to deal with code is through functions. Random instructions not tied to functions have much less support in terms of tooling and analysis.

This document describes structures defined in

#include <fxos/function.h>

Navigating functions, basic blocks and instructions

Functions in fxos are stored as Control Flow Graphs (CFG), which is a friendly format for analysis. The function itself is split into basic blocks, each consisting of a straight series of instructions terminated by an explicit or implicit jump to one or two other blocks. In essence, a basic block is the largest unit of sequential code that you can find in a function.

The Function structure

The Function structure is a BinaryObject, so it always lives in a binary that can be found with .parentBinary(). It also has the usual .address(), .size(), and name/comments.

When iterated with .begin()/.end(), it produces references to its basic blocks in an arbitrary order:

for(BasicBlock &bb: function) {
    /* ... */
}

Blocks are numbered from 0 to .blockCount() and can be accessed individually with .basicBlockByIndex(). The function's entry block can be found with .entryBlock().

The BasicBlock structure

The BasicBlock structure represents a node in the CFG. It always exists in the context of a function, which can be found with .parentFunction(). The binary that owns the function is also available as .parentBinary().

The block has its own .address() and .instructionCount(). Its main attraction is the list of instructions that it contains, which can be iterated over with .begin()/.end() or in reverse order with .rbegin()/.rend():

for(Instruction &insn: bb) {
    /* ... */
}
for(auto it = bb.rbegin(); it != bb.rend(); it++) {
    Instruction &insn = *it;
    /* ... */
}

Individual instructions can also be found with .instructionAtIndex().

Basic blocks usually end with a jump instruction; however, in some cases the next block follows the current one in memory, so there is no "jump"; control just keeps going forward. For instance, in an if/else statement the .false block might fall through to whatever code follows the condition:

         ╒═══════════╕
         │ condition │     true
         │ bf .false │───────────────╮
         ╘═══════════╛         ╒═══════════╕
         false │        .true: │ ...       │
               ↓               │ bra .end  │
         ╒═══════════╕         ╘═══════════╛
 .false: │ ...       │               │
         ╘═══════════╛               │
  fall-through ↓                     │
         ╒═══════════╕               │
   .end: │ ...       │←──────────────╯
         ╘═══════════╛

In this case .false has no jump. (It still wouldn't be possible to merge .false and .end because then .true would jump into the middle of a block, which is forbidden.)

The function .hasFallthrough() will indicate whether the block falls through. If it doesn't, .terminatorInstruction() will return a pointer to the jump that terminates it (if it does, .terminatorInstruction() will return nullptr).

A major detail of the SuperH ISA is that most branch instructions have delay slots. This means that even though a basic block conceptually ends when a jump is executed, typically the instruction following the jump instruction (which the CPU executes on-the-fly during the jump) is also part of the block. Hence, the block terminator is either the last or the second-to-last instruction in the block. The function .hasDelaySlot() will indicate whether the block has a delay slot.

Navigation in the CFG can be done by querying the block's .successors() and .predecessors() (both functions return read-only vectors of pointers to other blocks). Additional, hopefully self-explaining information, is available through .successorCount(), .predecessorCount(), .isEntryBlock() and .isTerminator().

The Instruction structure

The Instruction structure represents a single instruction, within the context of a function. The basic block, function and binary owning it can be queried with .parentBlock(), .parentFunction() and .parentBinary().

This structure is instantiated in RAM for every single instruction registered as part of a function (an order of magnitude is several millions for a standard OS binary) so this structure keeps a minimal number of attributes. In particular, analysis results are not stored here, and instead queried from the binary as annotations.

The instruction's opcode can be accessed with .opcode(), and this can be used to check if the instruction is a branch, a memory access, what its operands are, etc. Instruction only tracks the context and analysis results. Along with the opcode, .size() will give the instruction's size in bytes (which is usually 2 but can be 4 for DSP instructions).

The instruction has its own .address() and its relationship to other instructions in its block can be found with .indexInBlock(), .isFirstInBlock(), .isLastInBlock() and .isInDelaySlot(). Note that again, due to delay slots, being last and being a jump are not the same thing. However, since only jumps have delay slots and jumps are always block terminators, being in a delay slot does imply being the last instruction in a block.

Function analysis

TODO:

  • Function prototypes
  • References
  • Cross-references
  • Dominators and post-dominators