python sql algorithm search search-engine

Search Algorithm for Imperfect Matches in Series of Rows

I'm working on demonstrating some principles of malware scanning (I don't know if this is a "traditional" method but I believe it will work to some extent, regardless, I'm trying to implement it).

At this stage I'm trying to map specific function calls (ie: networking calls) in a disassembled program and check my SQL database to see if there are any matches with known malware.

Here is what my "function calls" table looks looks like:

malwareID           address           function           order

    1               8048000            socket              1
    1               8048010            bind                2
    1               8048020            listen              3
    1               8048030            bind                4
    1               8048040            recv                5

At the moment I can only check for direct matches. I disassemble the target program, I retrieve all the network function calls and their addresses and I check for direct matches in the table (ie: SELECT malwareID WHERE address = 'addr' AND function = 'func'). This can detect some variants of a malware, but mostly doesn't work because it depends on the variant having the same function calls at the same address'.

What I would like to do instead is search for the function calls in the database by their 'order'. Let's say after disassembling and extrating function calls I end up with a list like this:

8041000     socket
8041010     bind
8041020     listen
8041030     bind
8041040     send
8041050     recv

While the addresses don't match the known variant stored in the database, the order of the calls and the "distance" between calls is the same/similar, except that a new send call was slipped in this new variant.

Since I've never done any software engineering or classes where you learn these kind of "searching" algorithms, I need pointed in the right direction.

I'm looking for an algorithm that searches a database for a match to a series of rows, perhaps taking into account 'distances' between calls as well, and allows for some tolerance (ie: additional call inserted, missing one call, one call replaced by other).

Are there any algorithms that do more or less this that I can read up on?

(Note: I'm working with Python and sqlite, but I welcome any pseudo-codish ideas on how to do this)

Solution

If I understand correctly, I think you should consider storing pairs of function calls and storing the address distance between those calls. Something like :

malwareID       functionA  distance functionB       order

1               socket    10        bind            1
1               bind      10        listen          2
1               listen    10        bind            3
1               bind      10        recv            4

This way you can do simple SQL query where can do a count(*) to check if there is enough matches to raise a malware security issue.