Search code examples
pythonbinaryoffsetaho-corasick

How to get the same offset between readelf/IDA and Aho-Corasick in a binary


First of all I'm new to working with binaries and hope this is not a stupid question.

I have generated tables with sequences of instructions from the .text section of a binary. A table with 2-instruction sequences looks like that:

sequence         | total | relative
------------------------------------
e3a0b000e3a0e000 | 2437  |  0.0469
...

The sequences were extracted using IDAPython with the generated text files looking like that:

9c54    SUBROUTINE
9c54    e3a0b000    MOV             R11, #0
9c58    e3a0e000    MOV             LR, #0
...

UPDATED

Now I'm using the Aho-Corasick algorithm to match these sequences in the same binary from which I extracted them. I just add all sequences from the table to the Aho automaton:

import binascii

import ahocorasick

from connect_db import DB
from get_metadata import get_meta

a = ahocorasick.Automaton()
meta = get_meta()
with DB('test.db') as db:
    for idx, key in enumerate(list(db.select_query(meta['select_queries']['select_all'].format('sequence_two')))):
        a.add_word(key[0], (idx, key[0]))

a.make_automaton()
with open('../test/test_binary', 'rb') as f:
    for sub in a.iter(f.read().hex()):
        print('file offset: %s; length: %d; sequence: %s' % (hex(sub[0]), len(sub[1][1]), sub[1][1]))

Then I get the following outout:

file offset: 0x38b7; length: 16; sequence: e3a0b000e3a0e000
...

My problem is that Aho-Corasick returns 0x38b7 and I used ghex in Ubuntu to look into the binary again and found the two instructions at the expected offset:

offset:  bytes:
00001C54 E3A0B000 E3A0E000 ...

Meaning I should find them in the range of 0x1c54 - 0x1c5c which is the raw offset (0x9c54 - 0x8000)

I have not really understood yet how I get to the same offset but I'd like to get the raw offset using Aho-Corasick. I know that Aho-Corasick returns the offset of the end of the key word.


Solution

  • I was able to fix the problem when I figured out that converting the bytes to hex ascii, the characters would take more memory. I had to half the returned offset from Aho-Corasick to get the real raw offset:

    BEFORE

    with open('../test/test_binary', 'rb') as f:
    for sub in a.iter(f.read().hex()):
        print('file offset: %s; length: %d; sequence: %s' % (hex(sub[0]), len(sub[1][1]), sub[1][1]))
    

    AFTER

    with open('../test/test_binary', 'rb') as f:
    for sub in a.iter(f.read().hex()):
        print('file offset: %s; length: %d; sequence: %s' % (hex(int(sub[0] / 2)), len(sub[1][1]), sub[1][1]))
    

    The new output is almost as expected:

    file offset: 0x1c5b; length: 16; sequence: e3a0b000e3a0e000
    

    NOTE

    When dividing the offset by 2, it turns the integer into a float. I have to keep in mind that converting the float back into an integer, will round the value up or down.