Future changes to data model for future expansion

Lephenixnoir commented

2023-08-26 15:04:57 +02:00

Owner

Cc @Dr-Carlos for potential comments.

Recently I've been looking back at fxos to see if I can reverse-engineer (and automate the detection of) the address and size of the Python heap on the G-III. Knowing these values is nothing really new since Heath already looked at decompilations of relevant syscalls. The novelty is I want an automated disassembly process to get it on-calc dynamically.

To do this in fxos it'd be pretty helpful to finally have function exploration, cross-references, and hence a static analysis pass. So I think it's time to address some issues with the data model and improve it to support all of these. (To be clear most of these issues are because I've rarely had a clear idea of where fxos was going precisely so I didn't anticipate the needs in advance.)

Current issues:

Disassembly is a bit of a mess because it contains (through Instruction and Argument) metadata for all instructions and all arguments, which is way too much data. For instance there is a int syscall_id for every single argument even though very few of them use it. When the data in question becomes SSA formulas at every control point this is no longer going to work.
The disassembly works on a per-instruction basis, which is too low-level. For any serious analysis, the main interface should be function-based or variable-based. This is not incompatible with the current flow, since we could also define an address range to be a variable of type "array of instructions" and look at it even if it's not qualified as a function.
The current idea of "symbol" corresponds to this notion of variable/function, however defining them as either addresses or syscall numbers is a bit confusing and leads to aliasing/sorting difficulties.
There's no way to "save progress" when disassembling things. Startup scripts will not scale to eg. saving the results of static analyses.

The proposed solution is to improve the data model by adopting the following structure. First, the data itself is organized in two "layers":

Loading layer: used to emulate program memory within the virtual address space. This would rely on VirtualSpace, MemoryArea and MemoryRegion exactly like it does currently.
Program layer: used to reconstruct the parts of the program/OS by marking out variables, functions, data tables, etc. The main types would be as follow:
- Binary: sort of like the current Disassembly, this object would serve as the entry point into the code. But it would function more in terms of variables/functions and include analysis results like call graphs, references and cross-references, etc. It would also not contain any trivial info like the current Argument structure; these would instead be computed on demand;
- Function, BasicBlock, Instruction: for describing functions, their prototypes, calling conventions, etc, and of course organizing the instructions. This is intended to be the main entry point for analysis (instead of going straight through to instructions loaded at random addresses). The Instruction structure would be generated on-demand but backed by a system of annotations to avoid massive empty storage.
- Variable: for global variables, data tables, etc. Would carry types eventually.
- A somewhat internal annotation type, to store instruction-level info: flags, static analysis results, this sort of thing.

One of the points is that removing all the unneeded auto-generated data from Disassembly means the Binary structure will be a lot leaner, which is a lot more practical to proper serialize and saving it to file. This paves the way for project saving later on.

Also, the Binary structure should probably work exclusively with virtual addresses. Having functions whose definition is a syscall number is not very practical for storage (think of a sorted map address: function). Also, we sometimes have automated processes that find functions in more complicated ways (eg. the main menu function). I'm thinking it would be easier to drop the current "symbols" index and instead use the syscall table to automatically name functions in ads or something similar.

I'll start working on this now; input is welcome. If I forgot anything I'll add it here. I'm hoping to implement a first static analysis soon to get these jsr references and build function call graphs and cross-references.

Cc @Dr-Carlos for potential comments. Recently I've been looking back at fxos to see if I can reverse-engineer (and automate the detection of) the address and size of the Python heap on the G-III. Knowing these values is nothing really new since Heath already looked at decompilations of relevant syscalls. The novelty is I want an automated disassembly process to get it on-calc dynamically. To do this in fxos it'd be pretty helpful to finally have function exploration, cross-references, and hence a static analysis pass. So I think it's time to address some issues with the data model and improve it to support all of these. (To be clear most of these issues are because I've rarely had a clear idea of where fxos was going _precisely_ so I didn't anticipate the needs in advance.) Current issues: 1. `Disassembly` is a bit of a mess because it contains (through `Instruction` and `Argument`) metadata for _all instructions_ and _all arguments_, which is way too much data. For instance there is a `int syscall_id` for every single argument even though very few of them use it. When the data in question becomes SSA formulas at every control point this is no longer going to work. 2. The disassembly works on a per-instruction basis, which is too low-level. For any serious analysis, the main interface should be function-based or variable-based. This is not incompatible with the current flow, since we could also define an address range to be a variable of type "array of instructions" and look at it even if it's not qualified as a function. 3. The current idea of "symbol" corresponds to this notion of variable/function, however defining them as either addresses or syscall numbers is a bit confusing and leads to aliasing/sorting difficulties. 4. There's no way to "save progress" when disassembling things. Startup scripts will not scale to eg. saving the results of static analyses. The proposed solution is to improve the data model by adopting the following structure. First, the data itself is organized in two "layers": - **Loading layer**: used to emulate program memory within the virtual address space. This would rely on `VirtualSpace`, `MemoryArea` and `MemoryRegion` exactly like it does currently. - **Program layer**: used to reconstruct the parts of the program/OS by marking out variables, functions, data tables, etc. The main types would be as follow: * `Binary`: sort of like the current `Disassembly`, this object would serve as the entry point into the code. But it would function more in terms of variables/functions and include analysis results like call graphs, references and cross-references, etc. It would also not contain any trivial info like the current `Argument` structure; these would instead be computed on demand; * `Function`, `BasicBlock`, `Instruction`: for describing functions, their prototypes, calling conventions, etc, and of course organizing the instructions. This is intended to be the main entry point for analysis (instead of going straight through to instructions loaded at random addresses). The `Instruction` structure would be generated on-demand but backed by a system of annotations to avoid massive empty storage. * `Variable`: for global variables, data tables, etc. Would carry types eventually. * A somewhat internal annotation type, to store instruction-level info: flags, static analysis results, this sort of thing. One of the points is that removing all the unneeded auto-generated data from `Disassembly` means the `Binary` structure will be a lot leaner, which is a lot more practical to proper serialize and saving it to file. This paves the way for project saving later on. Also, the `Binary` structure should probably work exclusively with virtual addresses. Having functions whose _definition_ is a syscall number is not very practical for storage (think of a sorted map `address: function`). Also, we sometimes have automated processes that find functions in more complicated ways (eg. the main menu function). I'm thinking it would be easier to drop the current "symbols" index and instead use the syscall table to automatically name functions in `ads` or something similar. I'll start working on this now; input is welcome. If I forgot anything I'll add it here. I'm hoping to implement a first static analysis soon to get these `jsr` references and build function call graphs and cross-references.

Dr-Carlos commented

2023-08-26 23:47:01 +02:00

Collaborator

Seems like a lot of work, but definitely the right way to go. My only worry is that accessing Instructions on-the-go might be slow, but I think we'll just have to see how it goes.

It might be useful to have some way of serialising Instructions with the Binary structure, to improve speed and allow sharing static analysis without having to share the whole ROM - but this can probably be implemented later once there's a better idea of how the structures will work internally.

Seems like a lot of work, but definitely the right way to go. My only worry is that accessing Instructions on-the-go might be slow, but I think we'll just have to see how it goes. It might be useful to have some way of serialising Instructions with the `Binary` structure, to improve speed and allow sharing static analysis without having to share the whole ROM - but this can probably be implemented later once there's a better idea of how the structures will work internally.

Dr-Carlos referenced this issue

2023-08-26 23:56:54 +02:00

Add `ss` commmand to search for a string #12

Lephenixnoir commented

2023-08-27 09:05:08 +02:00

Author

Owner

Thank you! Currently instructions are mapped in Disassembly through

std::map<uint32_t, Instruction> m_instructions;

Note that the act of "disassembling" itself is instant because we have a table of all 65536 non-DSP instructions. All this stores is flags like whether the instruction is in a delay slot.

In my proposed change, the Instruction would be built on-demand, which consists in:

Reading 16 bits from the VirtualSpace (can be a simple pointer access for any function that decodes in bulk);
Indexing the 65536-entry opcode array with the instruction's opcode (instant);
Getting the annotations (analysis results) from an annotation map kind of like

/* {address, annot_type} -> annot */
std::map<<uint32_t, int>, Annotation>

Since all annotations for one instruction would be contiguous, complexity-wise this should be the same algorithmic complexity as the current setup.

Regardless, I'm happy to see what happens and keep around add a cache of Instructions if it turns out that it's too slow. I think that could work well.

It might be useful to have some way of serialising Instructions with the Binary structure, to improve speed and allow sharing static analysis without having to share the whole ROM

Ah, interesting, I didn't think of that. So far my plan for serialization is just to serialize in binary for saving projects to disk, however I see the appeal of sharing static analysis results just like you can share a decompiled function. A text export/visualization with annotations in comments sounds like a really good idea.

Thank you! Currently instructions are mapped in `Disassembly` through ```cpp std::map<uint32_t, Instruction> m_instructions; ``` Note that the act of "disassembling" itself is instant because we have a table of all 65536 non-DSP instructions. All this stores is flags like whether the instruction is in a delay slot. In my proposed change, the `Instruction` would be built on-demand, which consists in: 1. Reading 16 bits from the VirtualSpace (can be a simple pointer access for any function that decodes in bulk); 2. Indexing the 65536-entry opcode array with the instruction's opcode (instant); 3. Getting the annotations (analysis results) from an annotation map kind of like ```cpp /* {address, annot_type} -> annot */ std::map<<uint32_t, int>, Annotation> ``` Since all annotations for one instruction would be contiguous, complexity-wise this should be the same algorithmic complexity as the current setup. Regardless, I'm happy to see what happens and keep around add a cache of Instructions if it turns out that it's too slow. I think that could work well. > It might be useful to have some way of serialising Instructions with the Binary structure, to improve speed and allow sharing static analysis without having to share the whole ROM Ah, interesting, I didn't think of that. So far my plan for serialization is just to serialize in binary for saving projects to disk, however I see the appeal of sharing static analysis results just like you can share a decompiled function. A text export/visualization with annotations in comments sounds like a really good idea.

Dr-Carlos commented

2023-08-27 09:12:43 +02:00

Collaborator

Looks great, good to see that the access speed should be fine.

Lephenixnoir commented

2023-09-02 20:37:31 +02:00

Author

Owner

Alright, I've starting working on this. After testing I decided to keep permanent Instruction structures, because I realized that generating them on-the-fly means the lifetime of any Instruction structure would be basically random, so we wouldn't easily be able to reference them and pass them around. I did reduce their size (44 → 24 bytes, with options to drop further if needed), so that should still help.

I also realize that having proper saveable projects (including analysis results and user annotations) will probably make the whole "fxos library" thing obsolete, since:

Assembly tables are pretty much constants anyway
Targets (ie. mappings for OS/RAM/etc files) will be saved in project files
Syscall symbols will only be used to auto-mark function names and prototypes when analyzing new OSes, so that's basically a data file that we could index on the repo
Other symbols will be saved in project files

I'm not sure it was super useful anyway... I wonder if there's any point keeping it around or if I should just remove it entirely to focus on saving projects. Do you have an opinion on this? Do you ever add things in your local fxos library at all?

Alright, I've starting working on this. After testing I decided to keep permanent `Instruction` structures, because I realized that generating them on-the-fly means the lifetime of any `Instruction` structure would be basically random, so we wouldn't easily be able to reference them and pass them around. I did reduce their size (44 → 24 bytes, with options to drop further if needed), so that should still help. I also realize that having proper saveable projects (including analysis results and user annotations) will probably make the whole "fxos library" thing obsolete, since: - Assembly tables are pretty much constants anyway - Targets (ie. mappings for OS/RAM/etc files) will be saved in project files - Syscall symbols will only be used to auto-mark function names and prototypes when analyzing new OSes, so that's basically a data file that we could index on the repo - Other symbols will be saved in project files I'm not sure it was super useful anyway... I wonder if there's any point keeping it around or if I should just remove it entirely to focus on saving projects. Do you have an opinion on this? Do you ever add things in your local fxos library at all?

Dr-Carlos commented

2023-09-03 01:11:08 +02:00

Collaborator

Alright, I've starting working on this. After testing I decided to keep permanent Instruction structures, because I realized that generating them on-the-fly means the lifetime of any Instruction structure would be basically random, so we wouldn't easily be able to reference them and pass them around. I did reduce their size (44 → 24 bytes, with options to drop further if needed), so that should still help.

Sounds good.

I also realize that having proper saveable projects (including analysis results and user annotations) will probably make the whole "fxos library" thing obsolete, since:

Assembly tables are pretty much constants anyway

Targets (ie. mappings for OS/RAM/etc files) will be saved in project files

Syscall symbols will only be used to auto-mark function names and prototypes when analyzing new OSes, so that's basically a data file that we could index on the repo

Other symbols will be saved in project files

I'm not sure it was super useful anyway... I wonder if there's any point keeping it around or if I should just remove it entirely to focus on saving projects. Do you have an opinion on this? Do you ever add things in your local fxos library at all?

Do you mean the FXOS_HOME dir? I have a whole bunch of custom targets and symbols in my fxdoc fork.

> Alright, I've starting working on this. After testing I decided to keep permanent `Instruction` structures, because I realized that generating them on-the-fly means the lifetime of any `Instruction` structure would be basically random, so we wouldn't easily be able to reference them and pass them around. I did reduce their size (44 → 24 bytes, with options to drop further if needed), so that should still help. Sounds good. > I also realize that having proper saveable projects (including analysis results and user annotations) will probably make the whole "fxos library" thing obsolete, since: > > - Assembly tables are pretty much constants anyway > - Targets (ie. mappings for OS/RAM/etc files) will be saved in project files > - Syscall symbols will only be used to auto-mark function names and prototypes when analyzing new OSes, so that's basically a data file that we could index on the repo > - Other symbols will be saved in project files > > I'm not sure it was super useful anyway... I wonder if there's any point keeping it around or if I should just remove it entirely to focus on saving projects. Do you have an opinion on this? Do you ever add things in your local fxos library at all? Do you mean the `FXOS_HOME` dir? I have a whole bunch of custom targets and symbols in my [fxdoc fork](https://gitea.planet-casio.com/Dr-Carlos/fxdoc).

Lephenixnoir commented

2023-09-24 21:14:50 +02:00

Author

Owner

Update time. I've done a bunch of work on this, and started replacing old abstractions with the new ones. Not everything's done, but I'm getting close.

Starting from the top, there are now projects; here is the documentation. A project basically carries one or more binaries (virtual address spaces with one or more files mapped in them). Each binary is typically an OS version, and carries its own analysis results. In the future the project will store cross-binary analyses (when comparing multiple OS versions).

Projects are saved to disk which is a big improvement compared to what we had before. All the commands from the documentation are implemented and all the convenient logic around remembering recent projects works as well.

Now, regarding how this relates to the previous way of managing data. Previously, the virtual space abstraction (vspace) did a lot of things unrelated to handling a virtual memory, such as storing instructions, symbols, etc. These responsibilities are now handled by the binary. This doesn't change much. The real difference is that binaries are saved to disk whereas vspaces used to be loaded from the fxosrc files in the FXOS_HOME folder.

In order to avoid multiple storage methods, I am looking to drop the fxosrc way of storing data. I have implemented a pm (Project Migrate) command which allows you to get a vspace defined in fxosrc and migrate it to a binary in the new project format. The command is currently incomplete; it migrates information about bindings (ie. which file to load where in memory) but doesn't currently migrate function and syscall names because functions are handled differently (read: way better) in the new data model.

You are welcome to try the development version to see if any problems arise. When booting up fxos, you will get a message "warning: Legacy fxosrc files found; use pm to migrate." and you will be able to list your old vspaces with ip and migrate them into a new project (an empty one will be created at startup) with pm. You can then name the project with pr and save it to disk with ps. Next time you open fxos is should load automatically. You can also load it with pl.

Old commands have been ported to the new project model so they should still work, however your function and syscalls names are currently ignored, so I imagine you'll want to go back to the master after testing.

From the shell there are relatively few changes. Some that come to mind is that ic (Info Claims) and is (Info Symbols) now do mostly the same thing because symbols have become binary objects (functions, variables, and lightweight marks) which always claim a region of the binary. The new merged command is called io (Info Objects). The Info OS command is now called ios. All the v* (vspace) commands are now classified under b* (binary), but they haven't really changed. vl (Vspace List) is now classified as ib (Info Binary) and ibs (Info Binaries).

Most things happened in the API, where a lot of raw structures are being replaced by classes which can do a lot more heavy lifting for building up abstractions. The next step is to keep doing that and remove the Disassembly class, which is also getting absorbed by Binary. Then everything will be ready for the new function analyses and the fun part will commence.

Update time. I've done a bunch of work on this, and started replacing old abstractions with the new ones. Not everything's done, but I'm getting close. Starting from the top, there are now projects; [here is the documentation](https://gitea.planet-casio.com/Lephenixnoir/fxos/src/commit/2a3f1845dee83812368fd74c7fcf8c9016bf8bdf/doc/projects.md). A project basically carries one or more _binaries_ (virtual address spaces with one or more files mapped in them). Each binary is typically an OS version, and carries its own analysis results. In the future the project will store cross-binary analyses (when comparing multiple OS versions). Projects are _saved to disk_ which is a big improvement compared to what we had before. All the commands from the documentation are implemented and all the convenient logic around remembering recent projects works as well. Now, regarding how this relates to the previous way of managing data. Previously, the virtual space abstraction (vspace) did a lot of things unrelated to handling a virtual memory, such as storing instructions, symbols, etc. These responsibilities are now handled by the binary. This doesn't change much. The real difference is that binaries are saved to disk whereas vspaces used to be loaded from the fxosrc files in the `FXOS_HOME` folder. In order to avoid multiple storage methods, I am looking to drop the fxosrc way of storing data. I have implemented a `pm` (Project Migrate) command which allows you to get a vspace defined in fxosrc and migrate it to a binary in the new project format. The command is currently incomplete; it migrates information about bindings (ie. which file to load where in memory) but doesn't currently migrate function and syscall names because functions are handled differently (read: way better) in the new data model. You are welcome to try the development version to see if any problems arise. When booting up fxos, you will get a message "warning: Legacy fxosrc files found; use pm to migrate." and you will be able to list your old vspaces with `ip` and migrate them into a new project (an empty one will be created at startup) with `pm`. You can then name the project with `pr` and save it to disk with `ps`. Next time you open fxos is should load automatically. You can also load it with `pl`. Old commands have been ported to the new project model so they should still work, however your function and syscalls names are currently ignored, so I imagine you'll want to go back to the `master` after testing. From the shell there are relatively few changes. Some that come to mind is that `ic` (Info Claims) and `is` (Info Symbols) now do mostly the same thing because symbols have become _binary objects_ (functions, variables, and lightweight marks) which always claim a region of the binary. The new merged command is called `io` (Info Objects). The Info OS command is now called `ios`. All the `v*` (vspace) commands are now classified under `b*` (binary), but they haven't really changed. `vl` (Vspace List) is now classified as `ib` (Info Binary) and `ibs` (Info Binaries). Most things happened in the API, where a lot of raw structures are being replaced by classes which can do a lot more heavy lifting for building up abstractions. The next step is to keep doing that and remove the `Disassembly` class, which is also getting absorbed by `Binary`. Then everything will be ready for the new function analyses and the fun part will commence.

Dr-Carlos commented

2023-09-25 12:34:24 +02:00

Collaborator

Wow, that's a lot of changes!

I like the new interface (though it seems a bit complex to begin with), especially the new prompt. Everything seems to work so far, though I of course ran into the "Creating new objects is a TODO o(x_x)o" message.

My only criticism would be that there seems to be a lot of commands now. Here are some suggestions for removing some:

Is there any point having bc without bm, or should these be combined?
Is there much use for ib given that ibs exists? If so, should this be combined with ios?
It also seems like there are a lot of project commands, but I think these are all still useful, except maybe pm (but this would be removed eventually?)

Wow, that's a lot of changes! I like the new interface (though it seems a bit complex to begin with), especially the new prompt. Everything seems to work so far, though I of course ran into the "Creating new objects is a TODO o(x_x)o" message. My only criticism would be that there seems to be a lot of commands now. Here are some suggestions for removing some: - Is there any point having `bc` without `bm`, or should these be combined? - Is there much use for `ib` given that `ibs` exists? If so, should this be combined with `ios`? - It also seems like there are a lot of project commands, but I think these are all still useful, except maybe `pm` (but this would be removed eventually?)

Lephenixnoir commented

2023-09-25 13:00:04 +02:00

Author

Owner

Thank you, I'm glad it works and even more so if you like it.

There are indeed many commands at the moment. This interface is inspired from r2/rizin, which has itself enough that they're not all listed in the main help message, and instead categorized by prefix with either sub-helps or linking to the documentation. I'm expecting that with more analysis and visualization features the number of commands will inevitably increase, so maybe we should do something similar. You've seen some of the stuff in the doc/ folder, I have more and I plan to use this more effectively.

Anyway, reducing the number as we go is definitely a good idea.

bm adds new files (eg. RAM dumps) which you don't always think of at creation time. So the difference between bm and bc is temporal. Though I agree we could factor them into a single command with an option. These commands are rarely used anyway.
ib is currently useless compared to ibs. My intent is that since in the future binaries will have a lot more info, ib could print these details. I have in mind a colormap of which parts of the binary have been identified as functions/variables/marks to help categorize "this address range is the kernel" or "this is the LINK application" or stuff like this.
Project functions are not used very often admittedly, especially pm and pr. They have to be available but I agree with the thought that it's not ideal to have them listed all the time.

Additionally, I think we could also get rid of :

. since it's only used in fxosrc (and I'm emphasizing scripting using the C++ API instead, which is a lot more powerful)
.dt as we can just embed disassembly tables in fxosrc (there aren't gonna be any extensions let's be real)
if is currently useless, I think I will replace it later with a command called vf (view/visualization function) that will print CFGs. I am thinking of using v as a category for complex visualization tasks (whereas i is quick basic info).

I feel like this is mostly a presentation problem. Do you think a better command listing would help (including not listing rare commands all the time) or is it fundamentally an issue with the number of actions available to the user?

Thank you, I'm glad it works and even more so if you like it. There are indeed many commands at the moment. This interface is inspired from r2/rizin, which has itself enough that they're not all listed in the main help message, and instead categorized by prefix with either sub-helps or linking to the documentation. I'm expecting that with more analysis and visualization features the number of commands will inevitably increase, so maybe we should do something similar. You've seen some of the stuff in the `doc/` folder, I have more and I plan to use this more effectively. Anyway, reducing the number as we go is definitely a good idea. - `bm` adds new files (eg. RAM dumps) which you don't always think of at creation time. So the difference between `bm` and `bc` is temporal. Though I agree we could factor them into a single command with an option. These commands are rarely used anyway. - `ib` is currently useless compared to `ibs`. My intent is that since in the future binaries will have a lot more info, `ib` could print these details. I have in mind a colormap of which parts of the binary have been identified as functions/variables/marks to help categorize "this address range is the kernel" or "this is the LINK application" or stuff like this. - Project functions are not used very often admittedly, especially `pm` and `pr`. They have to be available but I agree with the thought that it's not ideal to have them listed all the time. Additionally, I think we could also get rid of : - `.` since it's only used in fxosrc (and I'm emphasizing scripting using the C++ API instead, which is a lot more powerful) - `.dt` as we can just embed disassembly tables in fxosrc (there aren't gonna be any extensions let's be real) - `if` is currently useless, I think I will replace it later with a command called `vf` (view/visualization function) that will print CFGs. I am thinking of using `v` as a category for complex visualization tasks (whereas `i` is quick basic info). I feel like this is mostly a presentation problem. Do you think a better command listing would help (including not listing rare commands all the time) or is it fundamentally an issue with the number of actions available to the user?

Dr-Carlos commented

2023-09-25 13:20:00 +02:00

Collaborator

I'm expecting that with more analysis and visualization features the number of commands will inevitably increase, so maybe we should do something similar.

Yeah, that seems like a good way to go. At the moment the only things I wouldn't include in the main help message would be .dt, pm, ib and pr, though of course you may have different ideas.

Sub-help for each category would be useful in addition to the quick reference.

bm adds new files (eg. RAM dumps) which you don't always think of at creation time. So the difference between bm and bc is temporal. Though I agree we could factor them into a single command with an option. These commands are rarely used anyway.

True. Maybe bc could be removed, with bm replacing this job if the specified binary dosen't exist?

ib is currently useless compared to ibs. My intent is that since in the future binaries will have a lot more info, ib could print these details. I have in mind a colormap of which parts of the binary have been identified as functions/variables/marks to help categorize "this address range is the kernel" or "this is the LINK application" or stuff like this.

Sounds great. Maybe the ios info could be under here as well (I feel like these commands duplicate each other for much of the time).

. since it's only used in fxosrc (and I'm emphasizing scripting using the C++ API instead, which is a lot more powerful)

Now that everything is setup once and not recreated on startup, this could definitely be removed.

.dt as we can just embed disassembly tables in fxosrc (there aren't gonna be any extensions let's be real)

Yep, sounds good.

if is currently useless, I think I will replace it later with a command called vf (view/visualization function) that will print CFGs. I am thinking of using v as a category for complex visualization tasks (whereas i is quick basic info).

Also sounds good. Eventually some kind of call graph (e.g. vg) could fit under this category too.

I feel like this is mostly a presentation problem. Do you think a better command listing would help (including not listing rare commands all the time) or is it fundamentally an issue with the number of actions available to the user?

Sort of answered this already, but both are needed. The removal and simplifcation of unnecessary commands is more of an ongoing thing, but presentation is important too.

> I'm expecting that with more analysis and visualization features the number of commands will inevitably increase, so maybe we should do something similar. Yeah, that seems like a good way to go. At the moment the only things I wouldn't include in the main help message would be `.dt`, `pm`, `ib` and `pr`, though of course you may have different ideas. Sub-help for each category would be useful in addition to the quick reference. > - `bm` adds new files (eg. RAM dumps) which you don't always think of at creation time. So the difference between `bm` and `bc` is temporal. Though I agree we could factor them into a single command with an option. These commands are rarely used anyway. True. Maybe `bc` could be removed, with `bm` replacing this job if the specified binary dosen't exist? > - `ib` is currently useless compared to `ibs`. My intent is that since in the future binaries will have a lot more info, `ib` could print these details. I have in mind a colormap of which parts of the binary have been identified as functions/variables/marks to help categorize "this address range is the kernel" or "this is the LINK application" or stuff like this. Sounds great. Maybe the `ios` info could be under here as well (I feel like these commands duplicate each other for much of the time). > - `.` since it's only used in fxosrc (and I'm emphasizing scripting using the C++ API instead, which is a lot more powerful) Now that everything is setup once and not recreated on startup, this could definitely be removed. > - `.dt` as we can just embed disassembly tables in fxosrc (there aren't gonna be any extensions let's be real) Yep, sounds good. > - `if` is currently useless, I think I will replace it later with a command called `vf` (view/visualization function) that will print CFGs. I am thinking of using `v` as a category for complex visualization tasks (whereas `i` is quick basic info). Also sounds good. Eventually some kind of call graph (e.g. `vg`) could fit under this category too. > I feel like this is mostly a presentation problem. Do you think a better command listing would help (including not listing rare commands all the time) or is it fundamentally an issue with the number of actions available to the user? Sort of answered this already, but both are needed. The removal and simplifcation of unnecessary commands is more of an ongoing thing, but presentation is important too.

Lephenixnoir commented

2023-09-25 22:50:24 +02:00

Author

Owner

I looked at it more and consolidating these commands was a really good idea. I finally figured out how to have unix-like options in the parser (such as -s, -b binary, this sort of things), which I had given up on before. This is much more natural than the name=value syntax so I played with it here.

So far, I've merged ibs into a -a (all) option to ib. My ideas for future deeper info on binaries like the heatmap can be enabled via a switch, so you can have eg. ib -h which is detailed and ib -a which is short and covers all binaries.

I've also tried to merge bc into bm, but it turns out the parameters to bm don't really work because it doesn't even mention a binary's name. I found that using options works better if the parameters are similar. So instead I merged bc (Binary Create) and brm (Binary Remove) into options for bs (Binary Select). This is inspired by git's checkout -b option where you checkout (select) a branch (binary) while also creating it at the same time. I've also added a extra option for renaming the current binary.

I think I prefer ios to be separated from ib because an OS is supposed to a subtype of binaries; in principle you could have a binary that's a g1a add-in, or a g3a add-in, or something similar. However I wonder if we could merge isc into ios. What's the specific use case for isc? I haven't used it much because the full list is way too long and I rarely search individual addresses, since the disassembly already shows them.

I'll have future updates, hopefully once the new CFG pass creates actual functions it's gonna be a whole lot more fun.

Also sounds good. Eventually some kind of call graph (e.g. vg) could fit under this category too.

Yes, exactly! That's one of the things I'm looking forward to! :3

I looked at it more and consolidating these commands was a really good idea. I finally figured out how to have unix-like options in the parser (such as `-s`, `-b binary`, this sort of things), which I had given up on before. This is much more natural than the `name=value` syntax so I played with it here. So far, I've merged `ibs` into a `-a` (all) option to `ib`. My ideas for future deeper info on binaries like the heatmap can be enabled via a switch, so you can have eg. `ib -h` which is detailed and `ib -a` which is short and covers all binaries. I've also tried to merge `bc` into `bm`, but it turns out the parameters to `bm` don't really work because it doesn't even mention a binary's name. I found that using options works better if the parameters are similar. So instead I merged `bc` (Binary Create) and `brm` (Binary Remove) into options for `bs` (Binary Select). This is inspired by git's `checkout -b` option where you checkout (select) a branch (binary) while also creating it at the same time. I've also added a extra option for renaming the current binary. I think I prefer `ios` to be separated from `ib` because an OS is supposed to a subtype of binaries; in principle you could have a binary that's a g1a add-in, or a g3a add-in, or something similar. However I wonder if we could merge `isc` into `ios`. What's the specific use case for `isc`? I haven't used it much because the full list is way too long and I rarely search individual addresses, since the disassembly already shows them. I'll have future updates, hopefully once the new CFG pass creates actual functions it's gonna be a whole lot more fun. > Also sounds good. Eventually some kind of call graph (e.g. vg) could fit under this category too. Yes, exactly! That's one of the things I'm looking forward to! :3

Dr-Carlos commented

2023-09-25 23:33:53 +02:00

Collaborator

These all seem like great changes!

I think I prefer ios to be separated from ib because an OS is supposed to a subtype of binaries; in principle you could have a binary that's a g1a add-in, or a g3a add-in, or something similar. However I wonder if we could merge isc into ios. What's the specific use case for isc? I haven't used it much because the full list is way too long and I rarely search individual addresses, since the disassembly already shows them.

I usually use isc for scripting, but it does make sense to merge into ios - maybe ios -s to list syscalls.

These all seem like great changes! > I think I prefer `ios` to be separated from `ib` because an OS is supposed to a subtype of binaries; in principle you could have a binary that's a g1a add-in, or a g3a add-in, or something similar. However I wonder if we could merge `isc` into `ios`. What's the specific use case for `isc`? I haven't used it much because the full list is way too long and I rarely search individual addresses, since the disassembly already shows them. I usually use `isc` for scripting, but it does make sense to merge into `ios` - maybe `ios -s` to list syscalls.

Lephenixnoir commented

2024-01-05 00:16:52 +01:00

Author

Owner

Ok, so there are important news! Sorry for not posting here for so long. I've lost tracks of all the individual changes but I made it to removing almost all of the old abstractions. We now have the unit of work, which is the function.

Extensive documentation about functions is available here: 9b817fe808/doc/functions.md

By now I can reconstruct functions from their CFG. There is also a new printer, which doesn't look too different from the outside:

casiowin|cg_3.60> d %25
fun.8002e332 (%0025):
  bb.8002e1ac (%0024):
    8002e1ac:  d20e   mov.l   0x80385178, r2
    8002e1ae:  e604   mov     #4, r6
    8002e1b0:  d40e   mov.l   0x8c04cf0c, r4
    8002e1b2:  e500   mov     #0, r5
    8002e1b4:  422b   jmp     @r2
    8002e1b6:  4618   shll8   r6

  bb.8002e332:
    8002e332:  4f22   sts.l   pr, @-r15
    8002e334:  d35e   mov.l   %0006, r3
    8002e336:  430b   jsr     @r3
    8002e338:  0009   nop
    8002e33a:  d35e   mov.l   %0008, r3
    8002e33c:  430b   jsr     @r3
    8002e33e:  0009   nop
    8002e340:  d35d   mov.l   %0011, r3
    8002e342:  430b   jsr     @r3
    8002e344:  0009   nop
    8002e346:  af31   bra     <8002e1ac>
    8002e348:  4f26   lds.l   @r15+, pr

However the true fun news is that I've finally added a static analysis pass! It's pretty stupid at the moment, it only understands moves of different kinds. But it's enough in almost all cases to find the addresses of functions being called.

casiowin|cg_3.60> d -a %25
fun.8002e332 (%0025):
  bb.8002e1ac (%0024):
    mov.l   0x80385178, r2
    mov     #4, r6
    mov.l   0x8c04cf0c, r4
    mov     #0, r5
    jmp     @r2                 # 0x80385178
    shll8   r6

  bb.8002e332:
    sts.l   pr, @-r15
    mov.l   %0006, r3
    jsr     @r3                 # 0x8002c7c0
    nop
    mov.l   %0008, r3
    jsr     @r3                 # 0x8002c834
    nop
    mov.l   %0011, r3
    jsr     @r3                 # 0x8002cc10
    nop
    bra     <8002e1ac>
    lds.l   @r15+, pr

I had to look for a short example so it'd fit in the message, so this case is really dumb. But there are other functions (such as %10in this OS) where the registers used in jsr are loaded waaay ahead of time in a different part of the function, and in this case the analysis tracks down whether values change or not during execution.

Notice the -a flag (analysis) which adds these comments and removes opcodes for brevity. The idea is you rise in abstraction and should no longer have to care about instructions' addresses and other similar details. There are other options such as -v (verbose) to increase the amount of analysis info printed with the code.

I'm about to add an iterator over called functions and call xrefs, and from there exploring binaries should be a lot easier.

Ok, so there are important news! Sorry for not posting here for so long. I've lost tracks of all the individual changes but I made it to removing almost all of the old abstractions. We now have _the_ unit of work, which is the function. Extensive documentation about functions is available here: https://gitea.planet-casio.com/Lephenixnoir/fxos/src/commit/9b817fe8085a59747af0d27bdeae2bfc2768c66b/doc/functions.md By now I can reconstruct functions from their CFG. There is also a new printer, which doesn't look too different from the outside: ``` casiowin|cg_3.60> d %25 fun.8002e332 (%0025): bb.8002e1ac (%0024): 8002e1ac: d20e mov.l 0x80385178, r2 8002e1ae: e604 mov #4, r6 8002e1b0: d40e mov.l 0x8c04cf0c, r4 8002e1b2: e500 mov #0, r5 8002e1b4: 422b jmp @r2 8002e1b6: 4618 shll8 r6 bb.8002e332: 8002e332: 4f22 sts.l pr, @-r15 8002e334: d35e mov.l %0006, r3 8002e336: 430b jsr @r3 8002e338: 0009 nop 8002e33a: d35e mov.l %0008, r3 8002e33c: 430b jsr @r3 8002e33e: 0009 nop 8002e340: d35d mov.l %0011, r3 8002e342: 430b jsr @r3 8002e344: 0009 nop 8002e346: af31 bra <8002e1ac> 8002e348: 4f26 lds.l @r15+, pr ``` However the true fun news is that I've finally added a static analysis pass! It's pretty stupid at the moment, it only understands moves of different kinds. But it's enough in almost all cases to find the addresses of functions being called. ``` casiowin|cg_3.60> d -a %25 fun.8002e332 (%0025): bb.8002e1ac (%0024): mov.l 0x80385178, r2 mov #4, r6 mov.l 0x8c04cf0c, r4 mov #0, r5 jmp @r2 # 0x80385178 shll8 r6 bb.8002e332: sts.l pr, @-r15 mov.l %0006, r3 jsr @r3 # 0x8002c7c0 nop mov.l %0008, r3 jsr @r3 # 0x8002c834 nop mov.l %0011, r3 jsr @r3 # 0x8002cc10 nop bra <8002e1ac> lds.l @r15+, pr ``` I had to look for a short example so it'd fit in the message, so this case is really dumb. But there are other functions (such as `%10`in this OS) where the registers used in `jsr` are loaded waaay ahead of time in a different part of the function, and in this case the analysis tracks down whether values change or not during execution. Notice the `-a` flag (analysis) which adds these comments and removes opcodes for brevity. The idea is you rise in abstraction and should no longer have to care about instructions' addresses and other similar details. There are other options such as `-v` (verbose) to increase the amount of analysis info printed with the code. I'm about to add an iterator over called functions and call xrefs, and from there exploring binaries should be a lot easier.

Dr-Carlos commented

2024-01-07 09:56:43 +01:00

Collaborator

Great, thanks for all the progress! Static analysis will be really useful.

Sorry for not responding earlier, I've been rather busy IRL.

I've tried to have a look at the features in my local copy of fxos, but I'm unable to disassemble:

cg_3.60|fxcg50_au> ios
OS: type CG version 03.60.0000

Header information:
  Bootcode timestamp (DateA)                 (0x8001ffb0)  :  2017.0106.2008
  Bootcode checksum                          (0x8001fffc)  :  0x00c29090
  Serial number                              (0x8001ffd0)  :  ��������
  OS version                                 (0x80020020)  :  03.60.0000

Footer information:
  Detected footer address                                  :  0x80b5feb8
  Langdata entries found                                   :  6
  OS date (DateO)                            (0x80b5ffe0)  :  2021.0830.0914
  OS checksum                                (0x80b5fff8)  :  0x6b0c8782
  Computed OS checksum                                     :  0x6b0c8782

Syscall information:
  Syscall table address                      (0x8002007c)  :  0x80687cb4
  Entries that point to valid memory                       :  0x1f68
  First seemingly invalid entry                            :  0x01010000
  Syscall entries outside ROM:
    (none)
cg_3.60|fxcg50_au> ib
* fxcg50_au
  0 objects (totaling 0 bytes)

  Region  Start         End         File
  ──────────────────────────────────────────────────────────────────────
  ROM_P2  0xa0000000 .. 0xa07fffff  ./fxdoc/os/cg/3.60/50.au.bin
  ROM     0x80000000 .. 0x81ffffff  ./fxdoc/os/cg/3.60/50.au.bin
cg_3.60|fxcg50_au> d %0
error: invalid instruction 8002c548: d687 in superblock
cg_3.60|fxcg50_au>

Any suggestions as to why this might be happening? I hadn't updated fxos in months so I presume it's due to my setup not working after the recent changes.

Great, thanks for all the progress! Static analysis will be really useful. Sorry for not responding earlier, I've been rather busy IRL. I've tried to have a look at the features in my local copy of fxos, but I'm unable to disassemble: ``` cg_3.60|fxcg50_au> ios OS: type CG version 03.60.0000 Header information: Bootcode timestamp (DateA) (0x8001ffb0) : 2017.0106.2008 Bootcode checksum (0x8001fffc) : 0x00c29090 Serial number (0x8001ffd0) : �� OS version (0x80020020) : 03.60.0000 Footer information: Detected footer address : 0x80b5feb8 Langdata entries found : 6 OS date (DateO) (0x80b5ffe0) : 2021.0830.0914 OS checksum (0x80b5fff8) : 0x6b0c8782 Computed OS checksum : 0x6b0c8782 Syscall information: Syscall table address (0x8002007c) : 0x80687cb4 Entries that point to valid memory : 0x1f68 First seemingly invalid entry : 0x01010000 Syscall entries outside ROM: (none) cg_3.60|fxcg50_au> ib * fxcg50_au 0 objects (totaling 0 bytes) Region Start End File ────────────────────────────────────────────────────────────────────── ROM_P2 0xa0000000 .. 0xa07fffff ./fxdoc/os/cg/3.60/50.au.bin ROM 0x80000000 .. 0x81ffffff ./fxdoc/os/cg/3.60/50.au.bin cg_3.60|fxcg50_au> d %0 error: invalid instruction 8002c548: d687 in superblock cg_3.60|fxcg50_au> ``` Any suggestions as to why this might be happening? I hadn't updated fxos in months so I presume it's due to my setup not working after the recent changes.

Lephenixnoir commented

2024-01-07 18:37:42 +01:00

Author

Owner

Ah, it's my fault! The assembly tables aren't loaded properly. I'd planned to switch from reading them from the FXOS_PATH folder to embedding them in fxos, see 9b817fe808/lib/sh3.def and 9b817fe808/lib/sh4.def. However I hadn't realized that I'm still reading them from $FXOS_PATH/fxosrc. Due to the format update they must have failed to load on your end. I'll fix that.

Ah, it's my fault! The assembly tables aren't loaded properly. I'd planned to switch from reading them from the `FXOS_PATH` folder to embedding them in fxos, see https://gitea.planet-casio.com/Lephenixnoir/fxos/src/commit/9b817fe8085a59747af0d27bdeae2bfc2768c66b/lib/sh3.def and https://gitea.planet-casio.com/Lephenixnoir/fxos/src/commit/9b817fe8085a59747af0d27bdeae2bfc2768c66b/lib/sh4.def. However I hadn't realized that I'm still reading them from `$FXOS_PATH/fxosrc`. Due to the format update they must have failed to load on your end. I'll fix that.

Lephenixnoir commented

2024-01-07 19:38:47 +01:00

Author

Owner

This should be fixed by f5ad03152d. Please get rid of .dt in your startup script and the associated assembly tables, this way there's no ambiguity anymore.

This should be fixed by https://gitea.planet-casio.com/Lephenixnoir/fxos/commit/f5ad03152da13c76df239101e126dcbbb2857882. Please get rid of `.dt` in your startup script and the associated assembly tables, this way there's no ambiguity anymore.

Dr-Carlos commented

2024-01-07 20:05:47 +01:00

Collaborator

Great, it works now!

Forgive me for not being more familiar with the new(ish) interface, but is there a way to modify function names? It seems like I should be able to do this with af (though this feels like the wrong function for such a feature), but it doesn't seem to work.

It's great being able to see the automatically-named functions, and I can imagine the new possibilities for non-syscall functions once the static analysis improves.

I also just looked at d -v and I can see it becoming incredibly useful for beginners and when looking at more complex functions.

Thanks for all the work you're doing and I look forward to updates in the future!

Great, it works now! Forgive me for not being more familiar with the new(ish) interface, but is there a way to modify function names? It seems like I should be able to do this with `af` (though this feels like the wrong function for such a feature), but it doesn't seem to work. It's great being able to see the automatically-named functions, and I can imagine the new possibilities for non-syscall functions once the static analysis improves. I also just looked at `d -v` and I can see it becoming incredibly useful for beginners and when looking at more complex functions. Thanks for all the work you're doing and I look forward to updates in the future!

Lephenixnoir commented

2024-01-08 22:54:37 +01:00

Author

Owner

Aaaw thank you, you never fail to make me feel motivated about these small updates. Thanks a lot for that.

I've just now added saving for binary objects (ie. just functions for now). It doesn't actually save all individual instructions and just reanalyzes the function when you load the project. This is because I want to avoid keeping too much "state" as I'm worried about update logic. For instance, if I improve the analysis code it'd be annoying to have a mix of old and new results as you may think that the analysis failed to recognize something in a function when in reality it just hasn't run since the update. Similarly, if I save the entire call graph and then change something about a function it's a bit annoying to have to propagate and check for consistency all the time.

Currently I'm planning on saving only high-level info and user-specified info. For instance, look at %025 in my message from 4 days ago. You might have noticed that %025 is actually the second block. But it jumps into the first block (which is actually %024) and thus the analysis is confused into thinking that this is a single function, while in reality, this is really two functions with a sneaky tail call. If you specify manually that that jmp is a tail call or run an advanced function to detect it then I'm gonna save it in the project file. Of course, custom names, prototypes etc. would also be saved. By contrast, remembering every single instruction in the function is kind of useless because we can look for that again after reloading the project.

This does mean that when you analyze all 8000 syscalls in an fx-CG OS and then reload it's going to run function reconstruction, which is a bit slow. Currently it takes 1.3 second on my machine if running analysis, 0.3 second if not. I'll leave it like this for now as I want to focus on finding the test menu on the G-III to help with a Fugue crash issue (to be honest it's still much faster than opening a Ghidra project for now...) but I'll look for possible optimizations later. Worst case there are two easy tricks which are analyze on-demand and load in a background thread, and then multi-threading is also an option.

Do you have any opinion on the loading speed of projects or this storage method?

Forgive me for not being more familiar with the new(ish) interface, but is there a way to modify function names? It seems like I should be able to do this with af (though this feels like the wrong function for such a feature), but it doesn't seem to work.

Nope there isn't yet! I think it's the last regression remaining from the versions from before this issue. I think you'd do this with a general edition command, af feels like it would remain automated.

When I get this to work (hopefully in an ergonomic way) and clean up old abstractions fully (there's really few remaining by now) I'll consider this update sufficiently advanced for merging to main.

I don't know how clear of a picture you have of static analysis, but this is like really a game-changer level of functionality. (Not the version right now, which is very basic, but the general framework.) You can do stuff like infer function prototypes, guess stack layouts, and it's the first building block for decompilation. It's soooo fun. :D

Aaaw thank you, you never fail to make me feel motivated about these small updates. Thanks a lot for that. I've just now added saving for binary objects (ie. just functions for now). It doesn't actually save all individual instructions and just reanalyzes the function when you load the project. This is because I want to avoid keeping too much "state" as I'm worried about update logic. For instance, if I improve the analysis code it'd be annoying to have a mix of old and new results as you may think that the analysis failed to recognize something in a function when in reality it just hasn't run since the update. Similarly, if I save the entire call graph and then change something about a function it's a bit annoying to have to propagate and check for consistency all the time. Currently I'm planning on saving only high-level info and user-specified info. For instance, look at `%025` in my message from 4 days ago. You might have noticed that `%025` is actually the _second_ block. But it jumps into the first block (which is actually `%024`) and thus the analysis is confused into thinking that this is a single function, while in reality, this is really two functions with a sneaky tail call. If you specify manually that that `jmp` is a tail call or run an advanced function to detect it then I'm gonna save it in the project file. Of course, custom names, prototypes etc. would also be saved. By contrast, remembering every single instruction in the function is kind of useless because we can look for that again after reloading the project. This does mean that when you analyze all 8000 syscalls in an fx-CG OS and then reload it's going to run function reconstruction, which is a bit slow. Currently it takes 1.3 second on my machine if running analysis, 0.3 second if not. I'll leave it like this for now as I want to focus on finding the test menu on the G-III to help with a Fugue crash issue (to be honest it's still much faster than opening a Ghidra project for now...) but I'll look for possible optimizations later. Worst case there are two easy tricks which are analyze on-demand and load in a background thread, and then multi-threading is also an option. Do you have any opinion on the loading speed of projects or this storage method? > Forgive me for not being more familiar with the new(ish) interface, but is there a way to modify function names? It seems like I should be able to do this with af (though this feels like the wrong function for such a feature), but it doesn't seem to work. Nope there isn't yet! I think it's the last regression remaining from the versions from before this issue. I think you'd do this with a general edition command, af feels like it would remain automated. When I get this to work (hopefully in an ergonomic way) and clean up old abstractions fully (there's really few remaining by now) I'll consider this update sufficiently advanced for merging to main. I don't know how clear of a picture you have of static analysis, but this is like really a game-changer level of functionality. (Not the version right now, which is very basic, but the general framework.) You can do stuff like infer function prototypes, guess stack layouts, and it's the first building block for decompilation. It's soooo fun. :D

Dr-Carlos commented

2024-01-09 21:45:45 +01:00

Collaborator

I've just now added saving for binary objects (ie. just functions for now). It doesn't actually save all individual instructions and just reanalyzes the function when you load the project. This is because I want to avoid keeping too much "state" as I'm worried about update logic. For instance, if I improve the analysis code it'd be annoying to have a mix of old and new results as you may think that the analysis failed to recognize something in a function when in reality it just hasn't run since the update.

This definitely makes sense, though I wouldn't mind an option to "export" a function so others without a copy of your ROM can analyse it. Not sure how useful this would be, it just seems like a logical addition to me.

This does mean that when you analyze all 8000 syscalls in an fx-CG OS and then reload it's going to run function reconstruction, which is a bit slow. Currently it takes 1.3 second on my machine if running analysis, 0.3 second if not. I'll leave it like this for now as I want to focus on finding the test menu on the G-III to help with a Fugue crash issue (to be honest it's still much faster than opening a Ghidra project for now...) but I'll look for possible optimizations later. Worst case there are two easy tricks which are analyze on-demand and load in a background thread, and then multi-threading is also an option.

Multi-threading or a background thread sound like good options. On my laptop it takes about 11 seconds to load with analysis, so having instant access to a terminal to start typing would be more efficient. Multi-threading would also be great, but I don't know how easy this is to implement with the current system. To be honest I don't know much about implementing multi-threading at all, so I can't really comment.

Forgive me for not being more familiar with the new(ish) interface, but is there a way to modify function names? It seems like I should be able to do this with af (though this feels like the wrong function for such a feature), but it doesn't seem to work.

Nope there isn't yet! I think it's the last regression remaining from the versions from before this issue. I think you'd do this with a general edition command, af feels like it would remain automated.

Cool, sounds good. If you're looking for command suggestions, on would work for Object Name.

I don't know how clear of a picture you have of static analysis, but this is like really a game-changer level of functionality. (Not the version right now, which is very basic, but the general framework.) You can do stuff like infer function prototypes, guess stack layouts, and it's the first building block for decompilation. It's soooo fun. :D

Yes, I'm very excited for the possibilities! It looks similar to my first steps when trying to manually decompile.

> I've just now added saving for binary objects (ie. just functions for now). It doesn't actually save all individual instructions and just reanalyzes the function when you load the project. This is because I want to avoid keeping too much "state" as I'm worried about update logic. For instance, if I improve the analysis code it'd be annoying to have a mix of old and new results as you may think that the analysis failed to recognize something in a function when in reality it just hasn't run since the update. This definitely makes sense, though I wouldn't mind an option to "export" a function so others without a copy of your ROM can analyse it. Not sure how useful this would be, it just seems like a logical addition to me. > This does mean that when you analyze all 8000 syscalls in an fx-CG OS and then reload it's going to run function reconstruction, which is a bit slow. Currently it takes 1.3 second on my machine if running analysis, 0.3 second if not. I'll leave it like this for now as I want to focus on finding the test menu on the G-III to help with a Fugue crash issue (to be honest it's still much faster than opening a Ghidra project for now...) but I'll look for possible optimizations later. Worst case there are two easy tricks which are analyze on-demand and load in a background thread, and then multi-threading is also an option. Multi-threading or a background thread sound like good options. On my laptop it takes about 11 seconds to load with analysis, so having instant access to a terminal to start typing would be more efficient. Multi-threading would also be great, but I don't know how easy this is to implement with the current system. To be honest I don't know much about implementing multi-threading at all, so I can't really comment. > > Forgive me for not being more familiar with the new(ish) interface, but is there a way to modify function names? It seems like I should be able to do this with af (though this feels like the wrong function for such a feature), but it doesn't seem to work. > > Nope there isn't yet! I think it's the last regression remaining from the versions from before this issue. I think you'd do this with a general edition command, af feels like it would remain automated. Cool, sounds good. If you're looking for command suggestions, `on` would work for Object Name. > I don't know how clear of a picture you have of static analysis, but this is like really a game-changer level of functionality. (Not the version right now, which is very basic, but the general framework.) You can do stuff like infer function prototypes, guess stack layouts, and it's the first building block for decompilation. It's soooo fun. :D Yes, I'm very excited for the possibilities! It looks similar to my first steps when trying to manually decompile.

Lephenixnoir commented

2024-01-09 23:39:57 +01:00

Author

Owner

This definitely makes sense, though I wouldn't mind an option to "export" a function so others without a copy of your ROM can analyse it. Not sure how useful this would be, it just seems like a logical addition to me.

Good idea!

Multi-threading or a background thread sound like good options. On my laptop it takes about 11 seconds to load with analysis, so having instant access to a terminal to start typing would be more efficient.

Oof, 11 seconds is just so slow. Just to check would you happen to have a debug build by any chance? It's also about that slow for me but only for debug builds. It's still annoying though in my opinion so I'll probably make it on-demand soon, i.e. doesn't analyze at startup but pretends all the data is available and analyzes when requested. That should be a much better experience...

Cool, sounds good. If you're looking for command suggestions, on would work for Object Name.

Thanks, I'll keep that in mind. I'll see if there are enough object-generic commands that make sense to have this, otherwise I'll keep a default "edit function" + "edit variable" approach. Seeing as there might be many function data to set (prototypes, calling conventions, instruction hints...) maybe that grouping makes more sense? We'll see.

> This definitely makes sense, though I wouldn't mind an option to "export" a function so others without a copy of your ROM can analyse it. Not sure how useful this would be, it just seems like a logical addition to me. Good idea! > Multi-threading or a background thread sound like good options. On my laptop it takes about 11 seconds to load with analysis, so having instant access to a terminal to start typing would be more efficient. Oof, 11 seconds is just so slow. Just to check would you happen to have a debug build by any chance? It's also about that slow for me but only for debug builds. It's still annoying though in my opinion so I'll probably make it on-demand soon, i.e. doesn't analyze at startup but pretends all the data is available and analyzes when requested. That should be a much better experience... > Cool, sounds good. If you're looking for command suggestions, on would work for Object Name. Thanks, I'll keep that in mind. I'll see if there are enough object-generic commands that make sense to have this, otherwise I'll keep a default "edit function" + "edit variable" approach. Seeing as there might be many function data to set (prototypes, calling conventions, instruction hints...) maybe that grouping makes more sense? We'll see.

Dr-Carlos commented

2024-01-10 06:54:40 +01:00

Collaborator

Oof, 11 seconds is just so slow. Just to check would you happen to have a debug build by any chance? It's also about that slow for me but only for debug builds. It's still annoying though in my opinion so I'll probably make it on-demand soon, i.e. doesn't analyze at startup but pretends all the data is available and analyzes when requested. That should be a much better experience...

Turns out I did! It's quite fast in a non-debug situation, so I don't really mind, but yes, on-demand analysis would be good.

> Oof, 11 seconds is just so slow. Just to check would you happen to have a debug build by any chance? It's also about that slow for me but only for debug builds. It's still annoying though in my opinion so I'll probably make it on-demand soon, i.e. doesn't analyze at startup but pretends all the data is available and analyzes when requested. That should be a much better experience... Turns out I did! It's quite fast in a non-debug situation, so I don't really mind, but yes, on-demand analysis would be good.

Lephenixnoir commented

2024-01-11 12:48:54 +01:00

Author

Owner

Ok I pushed an improvement to avoid having bad performance stay on the branch for too long. What I did is explore functions (ie. find which instructions are in them) when loading but not analyze yet. Instead analysis is computed on-demand.

I also profiled function exploration and identified a few easy bottlenecks, as you'd expect from a first version not intended for performance. Fixing these improved loading time by ~30%. With both changes I can now load my casiowin project which has 12000 syscalls saved as functions (8000 for a CG binary, 4000 for an FX binary) in about 400 ms. This will still likely increase in the future as we'll want to have more functions than that.

Full analysis would still block if we want to construct more complex objects such as a call graph; I'll think later about if that's a problem and what caching solutions there are if it is.

Ok I pushed an improvement to avoid having bad performance stay on the branch for too long. What I did is explore functions (ie. find which instructions are in them) when loading but not analyze yet. Instead analysis is computed on-demand. I also profiled function exploration and identified a few easy bottlenecks, as you'd expect from a first version not intended for performance. Fixing these improved loading time by ~30%. With both changes I can now load my `casiowin` project which has 12000 syscalls saved as functions (8000 for a CG binary, 4000 for an FX binary) in about 400 ms. This will still likely increase in the future as we'll want to have more functions than that. Full analysis would still block if we want to construct more complex objects such as a call graph; I'll think later about if that's a problem and what caching solutions there are if it is.

Dr-Carlos commented

2024-01-11 12:59:31 +01:00

Collaborator

Great, thank you! The load time is about 220 ms for me (just a CG binary), which is barely noticeable.

Future changes to data model for future expansion #14