Search code examples
c++binaryfilesbinary-dataextract

Question about parsing a binary file in C++


So i got this assignment for an interview and it seems kind of confusing.

My job is, i am given a couple of binary files(eg contacts, calls etc) and i need to extract as much information as i can.

I decoded the binary file using Hex Fiend(picture of hex fiend here) and i got a picture of how the calls should look like(picture of the calls here).

My assignment is in C++ and i managed to extract information such as the the phone numbers and the "TO" label but all the other data seems to be unreadable as chars. Is it encoded as an ascii message like description of headers, where the labels should be or is it supposed to be corrupted/unreadable?

I should be able to extract the date and duration as well.

So far i have parsed the file so when a character is less than or equal to 31 and bigger or equal to 127 to replace it as whitespace so i can see the letters/numbers which correspond to the actual data like the phone numbers.

My main idea to solve this kind of problem is to figure out the structure of the binary.

For example 01020304 could be a header that says this is a log and there are data.

Any ideas on how to solve the rest of the problem?

Thanks on advance!


Solution

  • This files looks like it contains fixed-length records, optionally with a header. I took the distance between two of these EFCD markers (0x34e and 0x3b8) and came up with 106 (or 0x6a). Try to resize your hex viewer such that 106 is an exact number of rows.

    6360 is an exact multiple of 106, so it seems like there is no header or footer.

    Let's look at a record in detail. I chose the one starting at 0x1a8 because it has some text we can look at.

    • Offset 0x00: Some kind of sequence number that seems to be different for some of these markers. We do not know how big it is, so let's guess 4 bytes for now.
    • Offset 0x04: For most records this is FF00 or FF02. 2 bytes.
    • Offset 0x06: Almost always FFFF, but not always. Also 2 bytes?
    • Offset 0x0C: This vaguely looks like a timestamp? 4 bytes
    • Offset 0x10: Finally, some text we recognize! Looks like it's in UCS-2, with all those 00 bytes inbetween. As the size of the record is fixed, this is 0x1e2-0x1b8=42 bytes.
    • Offset 0x58: The digits of the dialed number. This is probably also fixed size.

    There is some more stuff to find out, but I will leave that to you. As a final tip, use something like Kaitai struct (http://kaitai.io/) to write a language-agnostic definition of the binary format, from which you can generate parsers in all kinds of languages.