Search code examples
pythonbinarycompressiontheorylossless-compression

What encoding system/mechanism is used by Python's binary read?


I need to read a file in binary format, which I do like this:

with open("tex.pdf", mode='br') as file:  
fileContent = file.read()
for i in fileContent:
    print(i,end=" ")

This provides decimal integers, which I think are in ASCII format. However ASCII values cover only 0..127, whereas this output displays integers greater than 127, such as 225, 108, 180, and 193.

Can someone tell me what encoding/mechanism is used?


Solution

  • There is no encoding for reading raw bytes so you have to decode yourself with a specified encoding.

    Documentation:

    Files opened in binary mode (appending 'b' to the mode argument) return contents as bytes objects without any decoding

    Example:

    # file encoding: utf-16
    with open('data.txt', 'rb') as fp:
        buf = fp.read()
        print(buf)
    
    # Output
    b'\xff\xfeH\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00\n\x00'
    
    >>> buf.decode('utf-8')
    ...
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
    
    >>> buf.decode('utf-16')
    'Hello world\n'
    

    The text uses only ascii characters but the encoding is utf-16. I have to decode manually the raw bytes data.