Search code examples
c++arraysifstream

Read from binary file to array: Preceding arbitrary numbers


i'm trying to read from a binary file to a char array. When printing array entries, an arbitrary number (newline) and the desired number are being printed. I really cant get my head around this. The first few bytes of the file are: 00 00 08 03 00 00 EA 60 00 00 00 1C 00 00 00 1C 00 00

My Code:

  void MNISTreader::loadImagesAndLabelsToMemory(std::string imagesPath,
                          std::string labelsPath) {
  std::ifstream is(imagesPath.c_str());
  char *data = new char[12];

  is.read(data, 12);

  std::cout << std::hex  << (int)data[2] << std::endl;

  delete [] data;
  is.close();
}

E.g it prints:

ffffff9b
8

8 is correct. The preceding number changes from execution to execution. And where does this newline come from?


Solution

  • You asked about reading data from a binary file and saving it into a char[] and you showed us this code that you submitted for your question:

      void MNISTreader::loadImagesAndLabelsToMemory(std::string imagesPath,
                              std::string labelsPath) {
      std::ifstream is(imagesPath.c_str());
      char *data = new char[12];
    
      is.read(data, 12);
    
      std::cout << std::hex  << (int)data[2] << std::endl;
    
      delete [] data;
      is.close();
    }
    

    And you wanted to know:

    The preceding number changes from execution to execution. And where does this newline come from?

    Before you can actually answer that question you need to know the binary file. That is what is the structure of the file internally. When you are reading data from a binary you have to remember that some program had written data to that file and that data was written in a structured format. It is this format that is unique to each family or file type of binary that is important. Most binaries will usually follow a common pattern such that they would container a header then maybe even sub headers then either clusters, or packets or chunks, etc. or even raw data after the header while some binaries may just be purely raw data. You have to know how the file is structured in memory.

    • What is the structure of the data?
      • Is the data type for the first entry into the file a char = 1 byte, int = 4 bytes (32bit system) 8 bytes (64bit system), float = 4bytes, double = 8bytes, etc.

    According to your code you have an array of char with a size of 12 and knowing that a char is 1 byte in memory you are asking for 12 bytes. Now the problem here is that you are pulling off 12 consecutive individual bytes in a row and by not knowing the file structure how can you determine if the first byte was an actual char written or an unsigned char, or a int?

    Consider these two different binary file structures that are created by C++ structs that contains all the needed data and both are written out to a file in a binary format.

    A Generic Header Structure that both File Structures will use.

    struct Header {
        // Size of Header
        std::string filepath;
        std::string filename;
    
        unsigned int pathSize;
        unsigned int filenameSize;
    
        unsigned int headerSize;
        unsigned int dataSizeInBytes;
    };
    

    FileA Unique Structure For File A

    struct DataA {
        float width;
        float length;
        float height;
        float dummy; 
    }
    

    FileB Unique Structure For File B

    struct DataB {
        double length;
        double width;
    }
    

    A File in memory in general would be something like this:

    • First Few Bytes are the path and file name and stored sizes
      • This can vary from file to file depending on how many characters are used for both the file path and file name.
      • After the strings we do know that the next 4 data types are unsigned so we know that on a 32bit system it will be 4bytes x 4 = 16 total bytes
      • For a 64bit system it will be 8bytes x 4 = 32 total bytes.
      • If we know the system architecture then we can get past this easily enough.
      • Of these 4 unsigned(s) the first two are for the length of the path and filename. Now these could be the first two read in from the file and not the actual paths. The order of these could be reversed.
      • It is the next 2 unsigned(s) that are of importance
      • The next is the full size of the header and can be used to read in and skip over the header.
      • The next one tells you the size of the data to be pulled in now these could be in chunks with a count of how many chunks because it could be a series of the same data structures but for simplicity I left out chunks and counts and using a single instance structure.
      • It is here were we can then extract the amount of data in bytes by how many bytes to extract.

    Lets consider the two different binary files where we are already past all the header information and we are reading in the bytes to parse. We get to the size of the data in bytes and for FileA we have 4 floats = 16bytes and for FileB we have 2 doubles = 16bytes. So now we know how to call the method to read in x amount of data for a y type of data. Since y is now a type and x is amount of we can say this: y(x) As if y is a built in type and x is a numerical initializer for the default built in type of constructor for this built in type either it be an int, float, double, char, etc.

    Now let's say we were reading in either one of these two files but didn't know the data structure and how its information was previously stored to the file and we are seeing by the header that the data size is 16 bytes in memory but we didn't know if it was being stored as either 4 floats = 16 bytes or 2 doubles = 16 bytes. Both structures are 16 bytes but have a different amount of different data types.

    The summation of this is that without knowing the file's data structure and knowing how to parse the binary does become an X/Y Problem

    Now let's assume that you do know the file structure to try and answer your question from above you can try this little program and to check out some results:

    #include <string>
    #include <iostream>
    
    int main() {
    
        // Using Two Strings
        std::string imagesPath("ImagesPath\\");
        std::string labelsPath("LabelsPath\\");
    
        // Concat of Two Strings
        std::string full = imagesPath + labelsPath;
    
        // Display Of Both
        std::cout << full << std::endl;
    
        // Data Type Pointers 
        char* cData = nullptr;
        cData = new char[12];
    
        unsigned char* ucData = nullptr;
        ucData = new unsigned char[12];
    
        // Loop To Set Both Pointers To The String
        unsigned n = 0;
        for (; n < 12; ++n) {
            cData[n] = full.at(n);
            ucData[n] = full.at(n);
        }
    
        // Display Of Both Strings By Character and Unsigned Character
        n = 0;
        for (; n < 12; ++n) {
            std::cout << cData[n];
        }
        std::cout << std::endl;
    
        n = 0;
        for (; n < 12; ++n) {
            std::cout << ucData[n];
        }
        std::cout << std::endl;
        // Both Yeilds Same Result
        // Okay lets clear out the memory of these pointers and then reuse them.
    
        delete[] cData;
        delete[] ucData;
        cData = nullptr;
        ucData = nullptr;
    
        // Create Two Data Structurs 1 For Each Different File
        struct A {
            float length;
            float width;
            float height;
            float padding;
        };
    
        struct B {
            double length;
            double width;
        };
    
        // Constants For Our Data Structure Sizes
        const unsigned sizeOfA = sizeof(A);
        const unsigned sizeOfB = sizeof(B);
    
        // Create And Populate An Instance Of Each
        A a;
        a.length = 3.0f;
        a.width = 3.0f;
        a.height = 3.0f;
        a.padding = 0.0f;
    
        B b;
        b.length = 5.0;
        b.width = 5.0;
    
        // Lets First Use The `Char[]` Method for each struct and print them
        // but we need 16 bytes instead of `12` from your problem
        char *aData = nullptr;  // FileA
        char *bData = nullptr;  // FileB
    
        aData = new char[16];
        bData = new char[16];
    
        // Since A has 4 floats we know that each float is 4 and 16 / 4 = 4
        aData[0] = a.length;
        aData[4] = a.width;
        aData[8] = a.height;
        aData[12] = a.padding;
    
        // Print Out Result but by individual bytes without casting for A
        // Don't worry about the compiler warnings and build and run with the
        // warning and compare the differences in what is shown on the screen 
        // between A & B.
    
        n = 0;
        for (; n < 16; ++n) {
            std::cout << aData[n] << " ";
        }
        std::cout << std::endl;
    
        // Since B has 2 doubles weknow that each double is 8 and 16 / 8 = 2
        bData[0] = b.length;
        bData[8] = b.width;
    
        // Print out Result but by individual bytes without casting for B
        n = 0;
        for (; n < 16; ++n) {
            std::cout << bData[n] << " ";
        }
        std::cout << std::endl;
    
        // Let's Print Out Both Again But By Casting To Their Approriate Types
        n = 0;
        for (; n < 4; ++n) {
            std::cout << reinterpret_cast<float*>(aData[n]) << " ";
        }
        std::cout << std::endl;
    
        n = 0;
        for (; n < 2; ++n) {
            std::cout << reinterpret_cast<double*>(bData[n]) << " ";
        }
        std::cout << std::endl;
    
        // Clean Up Memory
        delete[] aData;
        delete[] bData;
        aData = nullptr;
        bData = nullptr;
    
        // Even By Knowing The Appropriate Sizes We Can See A Difference
        // In The Stored Data Types. We Can Now Do The Same As Above
        // But With Unsigned Char & See If It Makes A Difference.
    
        unsigned char *ucAData = nullptr;
        unsigned char *ucBData = nullptr;
    
        ucAData = new unsigned char[16];
        ucBData = new unsigned char[16];
    
        // Since A has 4 floats we know that each float is 4 and 16 / 4 = 4
        ucAData[0] = a.length;
        ucAData[4] = a.width;
        ucAData[8] = a.height;
        ucAData[12] = a.padding;
    
        // Print Out Result but by individual bytes without casting for A
        // Don't worry about the compiler warnings and build and run with the
        // warning and compare the differences in what is shown on the screen 
        // between A & B.
    
        n = 0;
        for (; n < 16; ++n) {
            std::cout << ucAData[n] << " ";
        }
        std::cout << std::endl;
    
        // Since B has 2 doubles weknow that each double is 8 and 16 / 8 = 2
        ucBData[0] = b.length;
        ucBData[8] = b.width;
    
        // Print out Result but by individual bytes without casting for B
        n = 0;
        for (; n < 16; ++n) {
            std::cout << ucBData[n] << " ";
        }
        std::cout << std::endl;
    
        // Let's Print Out Both Again But By Casting To Their Approriate Types
        n = 0;
        for (; n < 4; ++n) {
            std::cout << reinterpret_cast<float*>(ucAData[n]) << " ";
        }
        std::cout << std::endl;
    
        n = 0;
        for (; n < 2; ++n) {
            std::cout << reinterpret_cast<double*>(ucBData[n]) << " ";
        }
        std::cout << std::endl;
    
        // Clean Up Memory
        delete[] ucAData;
        delete[] ucBData;
        ucAData = nullptr;
        ucBData = nullptr;
    
        // So Even Changing From `char` to an `unsigned char` doesn't help here even
        // with reinterpret casting. Because These 2 Files Are Different From One Another.
        // They have a unique signature. Now a family of files where a specific application
        // saves its data to a binary will all follow the same structure. Without knowing
        // the structure of the binary file and knowing how much data to pull in and the big key
        // word here is `what type` of data you are reading in and by how much. This becomes an (X/Y) Problem.
        // This is the hard part about parsing binaries, you need to know the file structure. 
    
        char c = ' ';
        std::cin.get(c);
    
        return 0;
    }
    

    After running the short program above don't worry about what each value being displayed to the screen is; just look at the patterns that are there for the comparison of the two different file structures. This is just to show that a struct of floats that is 16 bytes wide is not the same as a struct of doubles that is also 16 bytes wide. So when we go back to your problem and you are reading in 12 individual consecutive bytes the question then becomes what does these first 12 bytes represent? Is it 3 ints or 3 unsigned ints if on 32bit machine or 2 ints or 2 unsigned ints on a 64bit machine, or 3 floats, or is a combination such as 2 doubles and 1 float? What is the current data structure of the binary file you are reading in?

    Edit In my little program that I wrote; I did forget to try or add in the << std::hex << in the print out statements they can be added in as well were each printing of the index pointers are used but there is no need to do so because the output to the display is the same exact thing as this only shows or expresses visually the difference of the two data structures in memory and what their patterns look like.