Search code examples
phpsortingbinarycompressionfile-format

How to store sorted set of integers in a binary file?


Large sets of 32-bit integer values, sorted ascending, only unique values, are stored for sequential reading. Original files are large, cannot fit into RAM, are read block by block. So far I just save them as binary files, 4 bytes per value.

As system grows there's a need to optimize storage/backup space, and I find there's a great potential for efficient compression of such data considering its sorted.

What I came up with is to store initial value and then the increments, which tend to be smaller values as the number of values in the set grows. Rounding their size to full bytes, I suggest to leave only meaningful bytes per increment, so instead of 4 there can be 1-2-3 bytes per increment. And to signal the number of bytes used, I'd use header 2 bits per increment.

The stream would look like:

01010101 01010101 01010101 01010101 - initial value

                                      Four increments block start
10110101                            - b bytes used: four 2-bit pairs
                                      (10 11 01 01 = 2, 3, 1, 1)
01010101 01010101                   - inc
01010101 01010101 01010101          - inc
01010101                            - inc
01010101                            - inc

                                      Four increments block start
11011101                            - b (11 01 11 01 = 3, 1, 3, 1)
01010101 01010101 01010101          - inc
01010101                            - inc
01010101 01010101 01010101          - inc
01010101                            - inc

...

Am I trying to invent a wheel here? Can stream compression be more efficient here, remaining operational on rather small blocks?


Solution

  • This is called variable-length integers or variable-length quantities. A possibly more efficient approach for your application depending on the distribution of your differences, and faster as well avoiding dealing with a stream of bits, is to code the end of the integer in the high bit of each byte, using the remaining seven bits of each byte for the integer bits. So for example, you could have a high bit of 1 indicate the end of the integer, with the integer bits stored most-significant bits first. Then 0..127 is stored as the bytes 0x80..0xff. If the first byte is 0x01..0x7f, then you have the high seven bits of the number, and you go to the next bytes for the next seven bits, until you get a high bit of one.

    This also has the property that an initial byte of 0x00 is not allowed, so you can use that to indicate the end of the stream.

    Your scheme might be more efficient if, for example, differences in the range of 128..255 are very common. In that case, your scheme uses ten bits for those, where the one described above uses 16. On the other hand if differences in the range 0..127 are the most common, the above may be best since it codes those as eight bits, where yours uses ten.

    Having done the coding, you may get further compression by applying a standard compression routine, such as zlib.