I have some data stored in a bit-packed format, which I'm trying to extract with Python. Specifically these are old bitmap fonts, where:
For example, here's the big-endian hex data for one glyph, which is 9 bits wide:
1F 1F DB 79 3D BF FE F7 86 CE 3E
If we break this up manually, we can see that this bit pattern represents a little smiley face:
8-bit bytes: 9-bit chunks: character (0='.', 1='$'):
0x1F = 00011111 000111110 ...$$$$$.
0x1F = 00011111 001111111 ..$$$$$$$
0xDB = 11011011 011011011 .$$.$$.$$
0x79 = 01111001 110010011 $$..$..$$
0x3D = 00111101 110110111 $$.$$.$$$
0xBF = 10111111 111111111 $$$$$$$$$
0xFE = 11111110 101111011 $.$$$$.$$
0xF7 = 11110111 110000110 $$....$$.
0x86 = 10000110 110011100 $$..$$$..
0xCE = 11001110 011111000 .$$$$$...
0x3E = 00111110
So I tried to do the same in Python, but I can't get the data to align properly when packed.
Here's what I have so far - just some test code to see if I can get the expected result from this particular piece of data.
Note the choice of field_type
, since that seems to be the root of my problems:
import struct
import ctypes
field_type = ctypes.c_ulonglong
class PackedBitmap(ctypes.BigEndianStructure):
_fields_ = [ ('line00', field_type, 9),
('line01', field_type, 9),
('line02', field_type, 9),
('line03', field_type, 9),
('line04', field_type, 9),
('line05', field_type, 9),
('line06', field_type, 9),
('line07', field_type, 9),
('line08', field_type, 9),
('line09', field_type, 9) ]
bm = PackedBitmap()
struct.pack_into('>11s', bm, 0,
b'\x1F\x1F\xDB\x79\x3D\xBF\xFE\xF7\x86\xCE\x3E')
for field in bm._fields_:
bin_str = f'{getattr(bm, field[0]):09b}'
print(bin_str + ' ' + bin_str.replace('0','.').replace('1','$'))
But no matter which C type I pick for my PackedBitmap's fields, I can't get the output right. There are always errors, and the size of the data type seems to determine where the first error will occur.
field_type = ctypes.c_ulonglong
:
000111110 ...$$$$$.
001111111 ..$$$$$$$
011011011 .$$.$$.$$
110010011 $$..$..$$
110110111 $$.$$.$$$
111111111 $$$$$$$$$
101111011 $.$$$$.$$
100001101 $....$$.$ # <- first error
100111000 $..$$$...
111110000 $$$$$....
field_type = ctypes.c_uint32
:
000111110 ...$$$$$.
001111111 ..$$$$$$$
011011011 .$$.$$.$$
001111011 ..$$$$.$$ # <- first error
011111111 .$$$$$$$$
111110111 $$$$$.$$$
100001101 $....$$.$
100111000 $..$$$...
111110000 $$$$$....
000000000 .........
field_type = ctypes.c_uint16
:
000111110 ...$$$$$.
110110110 $$.$$.$$. # <- first error
001111011 ..$$$$.$$
111111101 $$$$$$$.$
100001101 $....$$.$
001111100 ..$$$$$..
000000000 .........
000000000 .........
000000000 .........
000000000 .........
I'm not sure what's going on here: a length of 9 bits fits comfortably within all these field types (64, 32, and 16 bits respectively), so shouldn't this be packed as expected? What am I missing, and how to fix this?
It looks like the ctypes.Strucuture
has problems when crossing byte boundaries (depending on the field type).
We can print the offset the fields use:
print(PackedBitmap.line00)
...
print(PackedBitmap.line09)
Which gives us
<Field type=c_ulonglong_be, ofs=0:55, bits=9> # 1
<Field type=c_ulonglong_be, ofs=0:46, bits=9> # 2
<Field type=c_ulonglong_be, ofs=0:37, bits=9> # 3
<Field type=c_ulonglong_be, ofs=0:28, bits=9> # 4
<Field type=c_ulonglong_be, ofs=0:19, bits=9> # 5
<Field type=c_ulonglong_be, ofs=0:10, bits=9> # 6
<Field type=c_ulonglong_be, ofs=0:1, bits=9> # 7
<Field type=c_ulonglong_be, ofs=8:55, bits=9> # 8
<Field type=c_ulonglong_be, ofs=8:46, bits=9> # 9
<Field type=c_ulonglong_be, ofs=8:37, bits=9> # 10
55
to 63
. This is fine1
to 9
. This is also fine55
to 63
in the second byte. So we are missing bit 0
of byte 0
.This may be a bug, or at least unexpected. When you switch to other data types you will see that the error happens first when switching to the next byte.
It could be worth reporting this to the CPython devs. In my opinion this should at least be mentioned in the documentation. I could not find such a hint.
However, here is an implementation that doesn't use neither ctypes
nor struct
:
from collections.abc import Iterator
from itertools import batched
raw = b'\x1F\x1F\xDB\x79\x3D\xBF\xFE\xF7\x86\xCE\x3E'
def regroup_bits(buffer: bytes, bits: int) -> Iterator[str]:
binary = ''.join(f'{byte:08b}' for byte in buffer)
return (''.join(batch) for batch in batched(binary, bits))
for i in regroup_bits(raw, 9):
print(i.replace('0', '.').replace('1', '$'))
This could be optimised to be more memory friendly by not building one large string, but instead only as many bits as you need and then yield
a 9-bit value.
Also note, this implementation does not pad the last value.