Search code examples
pythonpandasbyte

Convert undelimited bytes to pandas DataFrame


I am sorry if this is a duplicate, but I didn't find a suitable answer for this problem.

If have a bytes object in python, like this:

b'\n\x00\x00\x00\x01\x00\x00\x00TEST\xa2~\x08A\x83\x11\xe3@\x05\x00\x00\x00\x03\x00\x00\x00TEST\x91\x9b\xd1?\x1c\xaa,@'

It contains first a certain number of integer (4bytes) then a string with 4 characters and then a certain number of floats (4bytes).
This is repeated a certain number of times which each correspond to a new row of data. The format of each row is the same and known. In the example this 2 rows of 2 integers, 1 string and 2 floats.

My question is, if there is a way to convert this kind of data to a pandas DataFrame directly.

My current approach was to first read all values (e.g. with struct.Struct.unpack) and place them in a list of lists. This however seem rather slow, especially for a large number of rows.


Solution

  • This work fine for me:

    import numpy as np
    import pandas as pd
    
    data = b'\n\x00\x00\x00\x01\x00\x00\x00TEST\xa2~\x08A\x83\x11\xe3@\x05\x00\x00\x00\x03\x00\x00\x00TEST\x91\x9b\xd1?\x1c\xaa,@'
    
    dtype = np.dtype([
        ('int1', np.int32),
        ('int2', np.int32),
        ('string', 'S4'),
        ('float1', np.float32),
        ('float2', np.float32),
    ])
    
    structured_array = np.frombuffer(data, dtype=dtype)
    
    df = pd.DataFrame(structured_array)
    
    df['string'] = df['string'].str.decode('utf-8')
    
    print(df)
    

    And give me this following output:

       int1  int2 string    float1    float2
    0    10     1   TEST  8.530916  7.095888
    1     5     3   TEST  1.637560  2.697883