Search code examples
c++linuxc++11terminalterminfo

read little endian 16 bit unsigned integer


I'm looking into parsing terminfo database files, which are a type of binary files. You can read about its storage format by your own and confirm the problem I'm facing.

The manual says -

The header section begins the file. This section contains six short integers in the format described below. These integers are

(1) the magic number (octal 0432);

...

...

Short integers are stored in two 8-bit bytes. The first byte contains the least significant 8 bits of the value, and the second byte contains the most significant 8 bits. (Thus, the value represented is 256*second+first.) The value -1 is represented by the two bytes 0377, 0377; other negative values are illegal. This value generally means that the corresponding capability is missing from this terminal. Machines where this does not correspond to the hardware must read the integers as two bytes and compute the little-endian value.


  • The first problem while parsing this type of input is that it fixes the size to 8 bits, so the plain old char cannot be used since it doesn't guarantees the size to be exactly 8 bits. So I was lookin 'Fixed width integer types' but again was faced with dillema of choosing b/w int8_t or uint8_t which clearly states - "provided only if the implementation directly supports the type". So what should I choose so that the type is portable enough.

  • The second problem is there is no buffer.readInt16LE() method in c++ standard library which might read 16 bytes of data in Little Endian format. So how should I proceed forward to implement this function again in a portable & safe way.

I've already tried reading it with char data type but it definitely produces garbage on my machine. Proper input can be read by infocmp command eg - $ infocmp xterm.


#include <fstream>
#include <iostream>
#include <vector>

int main()
{
    std::ifstream db(
      "/usr/share/terminfo/g/gnome", std::ios::binary | std::ios::ate);

    std::vector<unsigned char> buffer;

    if (db) {
        auto size = db.tellg();
        buffer.resize(size);
        db.seekg(0, std::ios::beg);
        db.read(reinterpret_cast<char*>(&buffer.front()), size);
    }
    std::cout << "\n";
}

$1 = std::vector of length 3069, capacity 3069 = {26 '\032', 1 '\001', 21 '\025',
  0 '\000', 38 '&', 0 '\000', 16 '\020', 0 '\000', 157 '\235', 1 '\001',
  193 '\301', 4 '\004', 103 'g', 110 'n', 111 'o', 109 'm', 101 'e', 124 '|',
  71 'G', 78 'N', 79 'O', 77 'M', 69 'E', 32 ' ', 84 'T', 101 'e', 114 'r',
  109 'm', 105 'i', 110 'n', 97 'a', 108 'l', 0 '\000', 0 '\000', 1 '\001',
  0 '\000', 0 '\000', 1 '\001', 0 '\000', 0 '\000', 0 '\000', 0 '\000',
  0 '\000', 0 '\000', 0 '\000', 0 '\000', 1 '\001', 1 '\001', 0 '\000',
....
....

Solution

  • The first problem while parsing this type of input is that it fixes the size to 8 bits, so the plain old char cannot be used since it doesn't guarantees the size to be exactly 8 bits.

    Any integer that is at least 8 bits is OK. While char isn't guaranteed to be exactly 8 bits, it is required to be at least 8 bits, so as far as size is concerned, there is no problem other than you may in some cases need to mask the high bits if they exist. However, char might not be unsigned, and you don't want the octets to be interpreted as signed values, so use unsigned char instead.

    The second problem is there is no buffer.readInt16LE() method in c++ standard library which might read 16 bytes of data in Little Endian format. So how should I proceed forward to implement this function again in a portable & safe way.

    Read one octet at a time into an unsigned char. Assign the first octet to the variable (that is large enough to represent at least 16 bits). Shift the bits of the second octet left by 8 and assign to the variable using the compound bitwise or.

    Or better yet, don't re-implement it, but use an third party existing library.

    I've already tried reading it with char data type but it definitely produces garbage on my machine.

    Then your attempt was buggy. There is no problem inherent with char that would cause garbage output. I recommend using a debugger to solve this problem.