Search code examples
c++fileparsingbinaryfread

C++ Read binary file into uint8 array to return decimal int gives wrong result


I try to parse a binary file and extract different data structures from it. One can be an uint8 or int8 (also uint16, int16 ... till 64).

To have a most universal method, I read in the data from the given file pointer and save it in an uint8 array (buffer).

With my test, I assumed that a file content of 40 (in hex) should lead to a resulting integer 64. That's why my test method asserts this values to be shore about it. ** Unfortunately the uint8 array's content results always in a decimal int of 52.** I don't know why and tries various other ways to read in a specific amount of bytes and assign them to an integer variable. Is this a topic of endianess or something?

Thanks in advance, if someone can help :)

My read_int method:

int read_int(FILE * file,int n,bool is_signed) throw(){
  assert(n>0);
  uint8_t n_chars[n];
  int result;
  for (int i = 0; i < n; i++)
  {
    if(fread(&n_chars[i],sizeof(n_chars[i]),1,file)!=1){
        std::cerr<< "fread() failed!\n";
        throw new ReadOpFailed();
    }
    result*=255;
    result+=n_chars[i];
  }
    std::cout<< "int read: "<<result<<"\n";
    return result;

//-------------Some ideas that didn't work out either------------------
    // std::stringstream ss;
    // ss << std::hex << static_cast<int>(static_cast<unsigned char>(n_chars)); // Convert byte to hexadecimal string
    // int result;
    // ss >> result; // Parse the hexadecimal string to integer
    // std::cout << "result" << result<<"\n";

One little test that tremendously fails... The part with the endian detection gives the output for little endian (don't know if this is anyhow a part of the problem).

struct TestContext{
    FILE * create_test_file_hex(char * input_hex,const char * rel_file_path = "test.gguf") {
        std::ofstream MyFile(rel_file_path, std::ios::binary);

        // Write to the file
        MyFile << input_hex;

        // Close the file
        MyFile.close();

        
        // std::fstream outfile (rel_file_path,std::ios::trunc);
        // char str[20] = 
        // outfile.write(str, 20);
        // outfile.close();

        FILE *file = fopen(rel_file_path,"rb");
        try{
            assert(file != nullptr);
        }catch (int e){
            std::cout << "file couldn't be opened due to exception n° "<<std::to_string(e)<<"\n";
            ADD_FAILURE(); 
        }
        std::remove(rel_file_path); //remove file whilst open, to be able to use it, but delete it after the last pointer was deleted.
    return file;
    }
};

TEST(test_tool_functions, test_read_int){
    int n = 1;
    // little endian if true
    if(*(char *)&n == 1) {std::cout<<"Little Endian Detected!!!\n";}
    else{std::cout<<"Big Endian Detected!!!\n";}
    std::string file_hex_content = "400A0E00080000016";
    
    uint64_t should;
    std::istringstream("40") >> std::hex >> should;
    ASSERT_EQ(should,64);
    
    uint64_t result = read_int(TestContext().create_test_file_hex(file_hex_content.data()),1,false);
    ASSERT_EQ(result,should);
}

Solution

  • The root cause of the problem is that your file_hex_content consists of ASCII character bytes (which form a human-readable hexadecimal string representation of a number), not of the bytes that would form a binary integer representation. Therefore it doesn’t start with a single byte 0x40 a.k.a. 64 but with a byte '4' (ASCII byte value 52) followed by another byte '0' (ASCII value 48). A single byte 64 (0x40) corresponds to the ASCII character '@' rather than two characters '4' and '0'.

    A small serialization example follows. As long as you serialize and deserialize on the same architecture and have no portability concerns, endianness is not a concern either.

    #include <cstdint>
    #include <ios>
    #include <iostream>
    #include <sstream>
    
    int main() {
      std::stringstream encoded;
    
      const uint64_t source{0xabcd1234deadbeefULL};
      encoded.write(reinterpret_cast<const char*>(&source), sizeof(source));
    
      uint64_t target;
      encoded.read(reinterpret_cast<char*>(&target), sizeof(target));
    
      std::cout << "source == target: " << std::hex << source << " == " << target
                << "\nserialized bytes:";
      for (const uint8_t byte : encoded.str())
        std::cout << ' ' << static_cast<uint32_t>(byte);
      std::cout << std::endl;
    }
    

    The output from the program above, when executed on my little endian machine, looks like this:

    source == target: abcd1234deadbeef == abcd1234deadbeef
    serialized bytes: ef be ad de 34 12 cd ab
    

    As expected, the serialized string starts from the lowest order byte 0xef and ends with the highest order byte 0xab. On a big endian platform, the second line would be ordered from highest to lowest order byte, i.e. ab cd 12 34 de ad be ef.