Search code examples
pythonpython-3.xctypesbit-fieldspack

Interpreting bit-packed data in Python with structs and ctypes: misaligned results


I have some data stored in a bit-packed format, which I'm trying to extract with Python. Specifically these are old bitmap fonts, where:

  • each line of a glyph is a group of n bits (where n is the character width).
  • only the first line is guaranteed to start on a byte boundary.
  • the bitmap data for each glyph is packed into a big-endian series of bytes.

For example, here's the big-endian hex data for one glyph, which is 9 bits wide:

1F 1F DB 79 3D BF FE F7 86 CE 3E

If we break this up manually, we can see that this bit pattern represents a little smiley face:

8-bit bytes:        9-bit chunks:   character (0='.', 1='$'):
                                         
0x1F = 00011111     000111110       ...$$$$$. 
0x1F = 00011111     001111111       ..$$$$$$$ 
0xDB = 11011011     011011011       .$$.$$.$$ 
0x79 = 01111001     110010011       $$..$..$$ 
0x3D = 00111101     110110111       $$.$$.$$$ 
0xBF = 10111111     111111111       $$$$$$$$$ 
0xFE = 11111110     101111011       $.$$$$.$$ 
0xF7 = 11110111     110000110       $$....$$. 
0x86 = 10000110     110011100       $$..$$$.. 
0xCE = 11001110     011111000       .$$$$$... 
0x3E = 00111110                               

So I tried to do the same in Python, but I can't get the data to align properly when packed.

Here's what I have so far - just some test code to see if I can get the expected result from this particular piece of data. Note the choice of field_type, since that seems to be the root of my problems:

import struct
import ctypes

field_type = ctypes.c_ulonglong

class PackedBitmap(ctypes.BigEndianStructure):
    _fields_ = [ ('line00', field_type, 9),
                 ('line01', field_type, 9),
                 ('line02', field_type, 9),
                 ('line03', field_type, 9),
                 ('line04', field_type, 9),
                 ('line05', field_type, 9),
                 ('line06', field_type, 9),
                 ('line07', field_type, 9),
                 ('line08', field_type, 9),
                 ('line09', field_type, 9) ]

bm = PackedBitmap()

struct.pack_into('>11s', bm, 0, 
                 b'\x1F\x1F\xDB\x79\x3D\xBF\xFE\xF7\x86\xCE\x3E')

for field in bm._fields_:
    bin_str = f'{getattr(bm, field[0]):09b}'
    print(bin_str + '     ' + bin_str.replace('0','.').replace('1','$'))

But no matter which C type I pick for my PackedBitmap's fields, I can't get the output right. There are always errors, and the size of the data type seems to determine where the first error will occur.

field_type = ctypes.c_ulonglong:

000111110     ...$$$$$.
001111111     ..$$$$$$$
011011011     .$$.$$.$$
110010011     $$..$..$$
110110111     $$.$$.$$$
111111111     $$$$$$$$$
101111011     $.$$$$.$$
100001101     $....$$.$   # <- first error
100111000     $..$$$...
111110000     $$$$$....

field_type = ctypes.c_uint32:

000111110     ...$$$$$.
001111111     ..$$$$$$$
011011011     .$$.$$.$$
001111011     ..$$$$.$$   # <- first error
011111111     .$$$$$$$$
111110111     $$$$$.$$$
100001101     $....$$.$
100111000     $..$$$...
111110000     $$$$$....
000000000     .........

field_type = ctypes.c_uint16:

000111110     ...$$$$$.
110110110     $$.$$.$$.   # <- first error
001111011     ..$$$$.$$
111111101     $$$$$$$.$
100001101     $....$$.$
001111100     ..$$$$$..
000000000     .........
000000000     .........
000000000     .........
000000000     .........

I'm not sure what's going on here: a length of 9 bits fits comfortably within all these field types (64, 32, and 16 bits respectively), so shouldn't this be packed as expected? What am I missing, and how to fix this?


Solution

  • It looks like the ctypes.Strucuture has problems when crossing byte boundaries (depending on the field type).

    We can print the offset the fields use:

    print(PackedBitmap.line00)
    ...
    print(PackedBitmap.line09)
    

    Which gives us

    <Field type=c_ulonglong_be, ofs=0:55, bits=9> # 1
    <Field type=c_ulonglong_be, ofs=0:46, bits=9> # 2
    <Field type=c_ulonglong_be, ofs=0:37, bits=9> # 3
    <Field type=c_ulonglong_be, ofs=0:28, bits=9> # 4
    <Field type=c_ulonglong_be, ofs=0:19, bits=9> # 5
    <Field type=c_ulonglong_be, ofs=0:10, bits=9> # 6
    <Field type=c_ulonglong_be, ofs=0:1, bits=9>  # 7
    <Field type=c_ulonglong_be, ofs=8:55, bits=9> # 8
    <Field type=c_ulonglong_be, ofs=8:46, bits=9> # 9
    <Field type=c_ulonglong_be, ofs=8:37, bits=9> # 10
    
    • The first field goes from bits 55 to 63. This is fine
    • The seventh goes from bits 1 to 9. This is also fine
    • The eights goes from bits 55 to 63 in the second byte. So we are missing bit 0 of byte 0.

    This may be a bug, or at least unexpected. When you switch to other data types you will see that the error happens first when switching to the next byte.

    It could be worth reporting this to the CPython devs. In my opinion this should at least be mentioned in the documentation. I could not find such a hint.

    However, here is an implementation that doesn't use neither ctypes nor struct:

    1. it concats the bytes as 1s and 0s together into one large string
    2. then it splits them into parts of length 9
    from collections.abc import Iterator
    from itertools import batched
    
    raw = b'\x1F\x1F\xDB\x79\x3D\xBF\xFE\xF7\x86\xCE\x3E'
    
    
    def regroup_bits(buffer: bytes, bits: int) -> Iterator[str]:
        binary = ''.join(f'{byte:08b}' for byte in buffer)
    
        return (''.join(batch) for batch in batched(binary, bits))
    
    
    for i in regroup_bits(raw, 9):
        print(i.replace('0', '.').replace('1', '$'))
    

    This could be optimised to be more memory friendly by not building one large string, but instead only as many bits as you need and then yield a 9-bit value.

    Also note, this implementation does not pad the last value.