Search code examples
pythonhexpickle

How to understand print result of byte data read from a pickle file?


I am trying to get data from pickle file. As I know, when we do serialization, the data is converted into byte stream. When I read the data as binary using this code:

f = open("alexnet.pth", "rb")
data = f.read()

I got this result

b'PK\x03\x04\x00\x00\x08\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x10\x00\x12\x00archive/data.pklFB\x0e\x00ZZZZZZZZZZZZZZ\x80\x02ccollections\nOrderedDict\nq\x00)Rq\x01(X\x11\x00\x00\x00features.0.weightq\x02ctorch._utils\n_rebuild_tensor_v2\nq\x03((X\x07\x00\x00\x00storageq\x04ctorch\nFloatStorage\nq\x05X\r\x00\x00\x002472041505024q\x06X\x03\x00\x00\x00cpuq\x07M\xc0Ztq\x08QK\x00(K@K\x03K\x0bK\x0btq\t(Mk\x01KyK\x0bK\x01tq\n\x89h\x00)Rq\x0btq\x0cRq\rX\x0f\x00\x00\x00features.0.biasq\x0eh\x03((h\x04h\x05X\r\x00\x00\x002472041504928q\x0fh\x07K@tq\x10QK\x00K@\x85q\x11K\x01\x85q\x12\x89h\x00)Rq\x13tq\x14Rq\x15X\x11\x00\x00\x00features.3.weightq\x16h\x03((h\x04h\x05X\r\x00\x00\x002472041505120q\x17h\x07J\x00\xb0\x04\x00tq\x18QK\x00(K\xc0K@K\x05K\x05tq\x19(M@\x06K\x19K\x05K\x01tq\x1a\x89h\x00)Rq\x1btq\x1cRq\x1dX\x0f\x00\x00\x00features.3.biasq\x1eh\x03((h\x04h\x05X\r\x00\x00\x002472041507136q\x1fh\x07K\xc0tqQK\x00K\xc0\x85q!K\x01\x85q"\x89h\x00)Rq#tq$Rq%X\x11\x00\x00\x00features.6.weightq&h\x03((h\x04h\x05X\r\x00\x00\x002472041509056q'h\x07J\x00 \n\x00tq(QK\x00(M\x80\x01K\xc0K\x03K\x03tq)(M\xc0\x06K\tK\x03K\x01tq*\x89h\x00)Rq+tq,Rq-X\x0f\x00\x00\x00features.6.biasq.h\x03((h\x04h\x05X\r\x00\x00\x002472041505312q/h\x07M\x80\x01tq0QK\x00M\x80\x01\x85q1K\x01\x85q2\x89h\x00)Rq3tq4Rq5X\x11\x00\x00\x00features.8.weightq6h\x03((h\x04h\x05X\r\x00\x00\x002472041508192q7h\x07J\x00\x80\r\x00tq8QK\x00(M\x00\x01M\x80\x01K\x03K\x03tq9(M\x80\rK\tK\x03K\x01tq:\x89h\x00)Rq;tq<Rq=X\x0f\x00\x00\x00features.8.biasq>h\x03((h\x04h\x05X\r\

I know those are hexadecimal characters. My question is does 1 byte contain 1 hexadecimal character (every "\" means 1 byte)? Or how to read this in terms of byte? Also I notice there are some English words such as "\x02ctorch._utils" and "n_rebuild_tensor_v2". What do they mean (hexadecimal + string)?


Solution

  • does 1 byte contain 1 hexadecimal character (every "" means 1 byte)?

    Technically, 1 byte can be represented by a number between 0 and 255, which is often represented by two hexadecimal character from 00 to FF, expressed in python as \x00 to \xFF. So yes, in a sense every "\" means one byte, but every 'normal' letter is a byte too. Python just chooses to print the ASCII character if the byte corresponds to a printable character in ASCII (numbers 32-126), and the '\x__'-representation if it doesn't ('ASCII control character' or >=128). But if a byte is printed as a character, that doesn't mean it was meant to be a character in the original data! (Although the readable function names surely are).

    How to read this in terms of byte?

    If you know what the byte is supposed to represent (int16, int32, float, char, ascii, utf-8, ...), you can convert them with Pythons struct module. Otherwise this representation is a good as any other.

    Also I notice there are some English words such as "\x02ctorch._utils" and "n_rebuild_tensor_v2". What do they mean (hexadecimal + string)?

    As mentioned, these are just these strings encoded in the data as ASCII (or UTF-8, no difference in this case). The non-printable byte in front is probably part of the data that comes before, there is no way to know for sure without knowing this particular format.

    As others have mentioned, there is not much to gain here by poking around this data. The code that writes these files is here. There is a lot of pickling and zipping going on, which mangles the original data even further.

    But its always good to poke around!