Search code examples
c++fstreambinaryfileshexdump

Reading binary file in C++ and output result as hexdump


I'm building a simpler version of xxd for a school project and I'm getting hung up on the file output when reading binary files only (i.e. when I read plain text files, everything works as expected).

Expected output:

0000000: 504b 0304 1400 0000 0800 70b6 4746 562d  PK........p.GFV-
0000010: e841 3600 0000 3f00 0000 0900 1c00 706c  .A6...?.......pl
0000020: 6169 6e2e 7478 7455 5409 0003 7307 d754  ain.txtUT...s..T
0000030: ba1d d754 7578 0b00 0104 f501 0000 0414  ...Tux..........
0000040: 0000 000b c9c8 2c56 00a2 e2fc dc54 85e2  ......,V.....T..
0000050: c4dc 829c 5485 92d4 8a12 ae10 a844 625e  ....T........Db^
0000060: 7e49 466a 9142 4e66 5eaa 4266 9e02 9003  ~IFj.BNf^.Bf....
0000070: 56a0 9096 9993 ca05 0050 4b01 021e 0314  V........PK.....
0000080: 0000 0008 0070 b647 4656 2de8 4136 0000  .....p.GFV-.A6..
0000090: 003f 0000 0009 0018 0000 0000 0001 0000  .?..............
00000a0: 00a4 8100 0000 0070 6c61 696e 2e74 7874  .......plain.txt
00000b0: 5554 0500 0373 07d7 5475 780b 0001 04f5  UT...s..Tux.....
00000c0: 0100 0004 1400 0000 504b 0506 0000 0000  ........PK......
00000d0: 0100 0100 4f00 0000 7900 0000 0000       ....O...y.....

Actual output:

0000000: 504b 0304 1400 0000 0800 70ffb6 4746 562d  PK........p.GFV-
0000010: ffe841 3600 0000 3f00 0000 0900 1c00 706c  .A6...?.......pl
0000020: 6169 6e2e 7478 7455 5409 0003 7307 ffd754  ain.txtUT...s..T
0000030: ffba1d ffd754 7578 0b00 0104 fff501 0000 0414  ...Tux..........
0000040: 0000 000b ffc9ffc8 2c56 00ffa2 ffe2fffc ffdc54 ff85ffe2  ......,V.....T..
0000050: ffc4ffdc ff82ff9c 54ff85 ff92ffd4 ff8a12 ffae10 ffa844 625e  ....T........Db^
0000060: 7e49 466a ff9142 4e66 5effaa 4266 ff9e02 ff9003  ~IFj.BNf^.Bf....
0000070: 56ffa0 ff90ff96 ff99ff93 ffca05 0050 4b01 021e 0314  V........PK.....
0000080: 0000 0008 0070 ffb647 4656 2dffe8 4136 0000  .....p.GFV-.A6..
0000090: 003f 0000 0009 0018 0000 0000 0001 0000  .?..............
00000a0: 00ffa4 ff8100 0000 0070 6c61 696e 2e74 7874  .......plain.txt
00000b0: 5554 0500 0373 07ffd7 5475 780b 0001 04fff5  UT...s..Tux.....
00000c0: 0100 0004 1400 0000 504b 0506 0000 0000  ........PK......
00000d0: 0100 0100 4f00 0000 7900 0000 0000 0000  ....O...y.......

Here's a quick diff of the two files for easy reference.

I have a feeling it's the way I'm reading the files. I decided to stick with the C++ libraries and use std::ifstream to read the files. Here's my implementation:

void DumpUtility::dump(const char* filename) {
    std::ifstream file(filename, std::ifstream::in|std::ifstream::binary); // open file for reading

    if(file.is_open()) { // ensure file is open and ready to go
        std::cout << std::hex << std::setfill('0'); // pad PC with leading zeros
        char buffer[this->bytesPerLine]; // buffer symbols

        while(file.good()) {
            file.read(buffer, this->bytesPerLine);
            std::cout << std::setw(7) << this->pc << ":";
            for(unsigned int i = 0; i < this->bytesPerLine; i++) {
                if(i % 2 == 0) std::cout << " ";

                std::cout << std::setw(2) << (unsigned short)buffer[i];
            }

            std::cout << "  ";
            for(unsigned int i = 0; i < this->bytesPerLine; i++) {
                if(isprint(buffer[i]) == 0) { // checks if character is printable
                    std::cout << ".";
                } else {
                    std::cout << buffer[i];
                }
            }
            std::cout << std::endl;
            this->pc += this->bytesPerLine;
        }
    } else {
        std::cerr << "Couldn't open file. General error..." << std::endl;
        exit(EXIT_FAILURE);
    }

    file.close();
}

So, file.read(buffer, this->bytesPerLine); is the line that reads the file and I format the data has hex via iomanip. I have also tried using printf(%02X, (unsigned short)buffer[i]); with no luck – same output.

What has been done

  • Using multiple compilers to recompile program
    • clang++ -O0 -g -Wall -c
    • g++ -g -Wall -c
  • Two versions of g++
    • 4.2.1 - OS X 10.10
    • 3.4.6 - Sun Solaris 10
  • Debugging in lldb and gdb in order to see where exactly these extra F's come from, I found nothing.

It seems like std::ifstream::read() is doing something other than simply storing the special characters as they are. Does anyone know what these extra F's represent and can anyone point me in the right direction to resolve this?

Note: I'm trying to understand how to do this using std::ifstream as opposed to using cstdio. If worst comes to worst then I'll implement the method using the File IO utilities in cstdio instead. If I can't do this using ifstream then I'll gladly take the explanation so I can learn!


Solution

  • Your problem is that your char type is signed. So when you write (unsigned short)buffer[i], it is translated as char -> int -> unsigned short.

    if the byte is under 0x7f, it is seen as >=0 and all is fine, but if not, it is internally padded with 1 bits to form a negative int. You first problem is on a b6. What actually happens is :

    b6 (signed char) -> FFFFFFb6 (signed int) -> FFB6 (unsigned short)
    

    Hopefully the fix is simple. You just have to write :

    std::cout << std::setw(2) << (unsigned short) (unsigned char) buffer[i];
    

    because now the conversion will correctly be :

    b6 (signed char) -> b6 (unsigned char) -> b6 (signed int) -> b6 (unsigned short)