I'm working on demonstrating some principles of malware scanning (I don't know if this is a "traditional" method but I believe it will work to some extent, regardless, I'm trying to implement it).
At this stage I'm trying to map specific function calls (ie: networking calls) in a disassembled program and check my SQL database to see if there are any matches with known malware.
Here is what my "function calls" table looks looks like:
malwareID address function order
1 8048000 socket 1
1 8048010 bind 2
1 8048020 listen 3
1 8048030 bind 4
1 8048040 recv 5
At the moment I can only check for direct matches. I disassemble the target program, I retrieve all the network function calls and their addresses and I check for direct matches in the table (ie: SELECT malwareID WHERE address = 'addr' AND function = 'func'
). This can detect some variants of a malware, but mostly doesn't work because it depends on the variant having the same function calls at the same address'.
What I would like to do instead is search for the function calls in the database by their 'order'. Let's say after disassembling and extrating function calls I end up with a list like this:
8041000 socket
8041010 bind
8041020 listen
8041030 bind
8041040 send
8041050 recv
While the addresses don't match the known variant stored in the database, the order of the calls and the "distance" between calls is the same/similar, except that a new send
call was slipped in this new variant.
Since I've never done any software engineering or classes where you learn these kind of "searching" algorithms, I need pointed in the right direction.
I'm looking for an algorithm that searches a database for a match to a series of rows, perhaps taking into account 'distances' between calls as well, and allows for some tolerance (ie: additional call inserted, missing one call, one call replaced by other).
Are there any algorithms that do more or less this that I can read up on?
(Note: I'm working with Python and sqlite, but I welcome any pseudo-codish ideas on how to do this)
If I understand correctly, I think you should consider storing pairs of function calls and storing the address distance between those calls. Something like :
malwareID functionA distance functionB order
1 socket 10 bind 1
1 bind 10 listen 2
1 listen 10 bind 3
1 bind 10 recv 4
This way you can do simple SQL query where can do a count(*) to check if there is enough matches to raise a malware security issue.