Search code examples
pythonpandascsvlzmabi5

Decompress and read Dukascopy .bi5 tick files


I need to open a .bi5 file and read the contents to cut a long story short. The problem: I have tens of thousands of .bi5 files containing time-series data that I need to decompress and process (read, dump into pandas).

I ended up installing Python 3 (I use 2.7 normally) specifically for the lzma library, as I ran into compiling nightmares using the lzma back-ports for Python 2.7, so I conceded and ran with Python 3, but with no success. The problems are too numerous to divulge, no one reads long questions!

I have included one of the .bi5 files, if someone could manage to get it into a Pandas Dataframe and show me how they did it, that would be ideal.

ps the fie is only a few kb, it will download in a second. Thanks very much in advance.

(The file) http://www.filedropper.com/13hticks


Solution

  • The code below should do the trick. First, it opens a file and decodes it in lzma and then uses struct to unpack the binary data.

    import lzma
    import struct
    import pandas as pd
    
    
    def bi5_to_df(filename, fmt):
        chunk_size = struct.calcsize(fmt)
        data = []
        with lzma.open(filename) as f:
            while True:
                chunk = f.read(chunk_size)
                if chunk:
                    data.append(struct.unpack(fmt, chunk))
                else:
                    break
        df = pd.DataFrame(data)
        return df
    

    The most important thing is to know the right format. I googled around and tried to guess and '>3i2f' (or >3I2f) works quite good. (It's big endian 3 ints 2 floats. What you suggest: 'i4f' doesn't produce sensible floats - regardless whether big or little endian.) For struct and format syntax see the docs.

    df = bi5_to_df('13h_ticks.bi5', '>3i2f')
    df.head()
    Out[177]: 
          0       1       2     3     4
    0   210  110218  110216  1.87  1.12
    1   362  110219  110216  1.00  5.85
    2   875  110220  110217  1.00  1.12
    3  1408  110220  110218  1.50  1.00
    4  1884  110221  110219  3.94  1.00
    

    Update

    To compare the output of bi5_to_df with https://github.com/ninety47/dukascopy, I compiled and run test_read_bi5 from there. The first lines of the output are:

    time, bid, bid_vol, ask, ask_vol
    2012-Dec-03 01:00:03.581000, 131.945, 1.5, 131.966, 1.5
    2012-Dec-03 01:00:05.142000, 131.943, 1.5, 131.964, 1.5
    2012-Dec-03 01:00:05.202000, 131.943, 1.5, 131.964, 2.25
    2012-Dec-03 01:00:05.321000, 131.944, 1.5, 131.964, 1.5
    2012-Dec-03 01:00:05.441000, 131.944, 1.5, 131.964, 1.5
    

    And bi5_to_df on the same input file gives:

    bi5_to_df('01h_ticks.bi5', '>3I2f').head()
    Out[295]: 
          0       1       2     3    4
    0  3581  131966  131945  1.50  1.5
    1  5142  131964  131943  1.50  1.5
    2  5202  131964  131943  2.25  1.5
    3  5321  131964  131944  1.50  1.5
    4  5441  131964  131944  1.50  1.5
    

    So everything seems to be fine (ninety47's code reorders columns).

    Also, it's probably more accurate to use '>3I2f' instead of '>3i2f' (i.e. unsigned int instead of int).