Search code examples
c++htk

understanding format of file


I have a question regarding file reading and I am getting frustrated over it as I am doing some handwriting recognition development and the tool I am using doesn't seem to read my training data file.

So I have one file which works perfectly fine. I paste some contents of that file here:

 è      Aڈ2*A   ê“AêA mwA)àXA$NلAئ~A›إA:ozA)"ŒA%IœA&»ّAم3ACA

|®AH÷AD¢A ô-A گ&AJXAsAA mGA قQAٍALs@÷8´A

The file is in a format I know about that first 12 bytes are 2 longs and 2 shorts with most probably data as 4 , 1000 , 1024 , 9 but T cannot read the file to get these values.

Actually I want to write my first 12 bytes in format similar to the mentioned above and I dont seem to get how to do it.

Forgot to mention that the remaining data are float points. When I write data into file I get human readable text not these symbols and when I am reading these symbols I do not get the actual values. How to get the actual floats and integers across these symbols?

My code is

struct rec
{
    long a;
    long b ;
    short c;
    short d;
}; // this is the struct 

FILE *pFile;
struct rec my_record;

// then I read using fread

fread(&my_record,1,sizeof(my_record),pFile);`

and the values i get in a, b, c and d are 85991456, -402448352, 8193, and 2336 instead of the actual values.


Solution

  • First of all, you should open that file in a hex editor, to see exactly what bytes it contains. From the text excerpt you have posted I think it does not contain 4, 1000, 1024 and 9 as you expect, but text form may be very misleading, because different character encodings show different characters for the same sequences of bytes.

    If you have confirmed that the file contains the expected data, there may be still other issues. One of these is endianness, some machines and file formats encode a 4-byte long with least significant byte first, while others read and write the most significant byte first.

    Other issue concerns the long data type you use. If your computer has a 64-bit architecture and you are using Linux, long is a 64-bit value, and your structure becomes 20 bytes long instead of 12.

    Edit:

    To read big-endian longs on a litte-endian machine like yours, you should read de data byte-by-byte and build the longs from them manually:

    // Read 4 bytes
    unsigned char buf[4];
    fread(buf, 4, 1, pFile);
    // Convert to long
    my_record.a = (((long)buf[0]) << 24) | (((long)buf[1]) << 16) | (((long)buf[2]) << 8) | ((long)buf[3]);