Search code examples
pythoncomparisonbyteendiannessradix

Comparison of byte literals in Python


The following question arose because I was trying to use bytes strings as dictionary keys and bytes values that I understood to be equal weren't being treated as equal.

Why doesn't the following Python code compare equal - aren't these two equivalent representations of the same binary data (the example is knowingly chosen to avoid endianness)?

b'0b11111111' == b'0xff'

I know the following evaluates true, demonstrating the equivalence:

int(b'0b11111111', 2) == int(b'0xff', 16)

But why does Python force me to know the representation? Is it related to endianness? Is there some easy way to force these to compare equivalent other than converting them all to, e.g., hexadecimal literals? Is there a transparent and clear method to move between all representations in a (somewhat) platform independent way (or am I asking too much)?

Say I want to actually index a dictionary using 8 bits in the form b'0b11111111', then why does Python expand it to ten bytes and how do I prevent that?

This is a smaller piece of a large tree data structure and expanding my indexing by a factor of 80 seems like a huge waste of memory.


Solution

  • Bytes can represent any number of things. Python cannot and will not guess at what your bytes might encode.

    For example, int(b'0b11111111', 34) is also a valid interpretation, but that interpretation is not equal to hex FF.

    The number of interpretations, in fact, is endless. The bytes could represent a series of ASCII codepoints, or image colors, or musical notes.

    Until you explicitly apply an interpretation, the bytes object consists just of the sequence of values in the range 0-255, and the textual representation of those bytes use ASCII if so representable as printable text:

    >>> list(bytes(b'0b11111111'))
    [48, 98, 49, 49, 49, 49, 49, 49, 49, 49]
    >>> list(bytes(b'0xff'))
    [48, 120, 102, 102]
    

    Those byte sequences are not equal.

    If you want to interpret these sequences explicitly as integer literals, then use ast.literal_eval() to interpret decoded text values; always normalise first before comparison:

    >>> import ast
    >>> ast.literal_eval(b'0b11111111'.decode('utf8'))
    255
    >>> ast.literal_eval(b'0xff'.decode('utf8'))
    255