CME datamine sample files in python

I have some sample market data from CME DataMine. The files look like the following

'1128=9\x019=156\x0135=X\x0149=CME\x0134=431875\x0152=20130715053428544\x0175=20130715\x01268=1\x01279=1\x0122=8\x0148=19590\x0183=1491\x01107=ZCN4-ZCU4\x01269=0\x01270=0.75\x01271=7\x01273=53428000\x01336=2\x01346=3\x011023=3\x0110=233\x01'\n
'1128=9\x019=104\x0135=f\x0149=CME\x0134=6613\x0152=20130714180206034\x0122=8\x0148=164576\x0175=20130715\x01107=ZC:BF H4-K4-N4\x01332=159\x01333=-161\x0110=138\x01'\n
'1128=9\x019=1634\x0135=X\x0149=CME\x0134=3177179\x0152=20130715181451830\x0175=20130715\x01268=16\x01279=1\x0122=8\x0148=1647\x0183=180151\x01107=ZC:BF U3-Z3-H4\x01269=1\x01270=45\x01271=195\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=21907\x0183=118087\x01107=ZCU3-ZCH4\x01269=1\x01270=21\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=21907\x0183=118088\x01107=ZCU3-ZCH4\x01269=1\x01270=21.25\x01271=10\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=504398\x0183=29594\x01107=ZCU3-ZCH5\x01269=1\x01270=11.25\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=504398\x0183=29595\x01107=ZCU3-ZCH5\x01269=1\x01270=11.5\x01271=2\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=21970\x0183=50008\x01107=ZCU3-ZCK4\x01269=1\x01270=13.5\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=21970\x0183=50009\x01107=ZCU3-ZCK4\x01269=1\x01270=13.75\x01271=8\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=558644\x0183=48554\x01107=ZCU3-ZCN4\x01269=1\x01270=7\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=558644\x0183=48555\x01107=ZCU3-ZCN4\x01269=1\x01270=7.25\x01271=8\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=22098\x0183=52504\x01107=ZCU3-ZCU4\x01269=1\x01270=15.25\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=22098\x0183=52505\x01107=ZCU3-ZCU4\x01269=1\x01270=15.5\x01271=1\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=0\x0122=8\x0148=504409\x0183=16282\x01107=ZCU3-ZCU5\x01269=1\x01270=25.25\x01271=1\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=588275\x0183=265779\x01107=ZCU3-ZCZ3\x01269=1\x01270=33\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=588275\x0183=265780\x01107=ZCU3-ZCZ3\x01269=1\x01270=33.25\x01271=20\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=0\x0122=8\x0148=519645\x0183=32899\x01107=ZCU3-ZCZ4\x01269=1\x01270=9.75\x01271=1\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=0\x0122=8\x0148=19282\x0183=20689\x01107=ZCU3-ZCZ5\x01269=1\x01270=29.75\x01271=1\x01273=181451000\x01276=K\x01336=2\x011023=2\x0110=007\x01'\n

That is, in each row, fields are separated by the \x01 character, and each field contains entries in the form tag=value.

The tags and values correspond to those specified in the FIXML specs: https://www.onixs.biz/fix-dictionary/5.0.SP2/fields_by_tag.html

I would like to parse this file into a Pandas DataFrame where the column names are the tag names, the rows are the rows in the data file and the values are parsed values (so for example, Column 1 will be AppVerId, and row 1, column 1 will be: FIX50SP2).

How can I do this using pandas?

I have tried the following:

import pandas as pd
df  = pd.read_table('data/corn/MDFF_CBT_20130714-20130715_7813_0',delimiter='\x01',header=None)

but get the following error:

ParserError: Error tokenizing data. C error: Expected 22 fields in line 3, saw 202

I am guessing this has to do with the fact that not all rows have the same number of columns.

Solution

Main problem is in third column tag are duplicated. So need defaultdict for store:

from collections import defaultdict

rows = []
with open("CME.dat") as f:
    for line in f:
        row_dict = defaultdict(list)
        # split each line
        for element in line.split('\\x01'):
            try:
                tag, val = element.split('=')
                row_dict[tag].append(val)
            except:
                pass
        rows.append(row_dict)
#print (rows)

Then create DataFrame but for scalar from lists first need replace NaNs to [], remove ' from columns names. Last use list comprehension with concat, reshape by stack and if necessary fill NaNs per rows add groupby with ffill:

df = pd.DataFrame(rows).applymap(lambda x: x if isinstance(x, list) else [])

df.columns = df.columns.str.strip("'")
df = (pd.concat([pd.DataFrame(df[x].values.tolist()) for x in df], keys=df.columns, axis=1)
       .stack()
       .groupby(level=0)
       .ffill()
       .reset_index(drop=True))

print (df)

   1128   10  1023             107 22   268   269    270   271        273  \
0     9  233     3       ZCN4-ZCU4  8     1     0   0.75     7   53428000   
1     9  138  None  ZC:BF H4-K4-N4  8  None  None   None  None       None   
2     9  007     2  ZC:BF U3-Z3-H4  8    16     1     45   195  181451000   
3     9  007     1       ZCU3-ZCH4  8    16     1     21     3  181451000   
4     9  007     2       ZCU3-ZCH4  8    16     1  21.25    10  181451000   
5     9  007     1       ZCU3-ZCH5  8    16     1  11.25     3  181451000   
6     9  007     2       ZCU3-ZCH5  8    16     1   11.5     2  181451000   
7     9  007     1       ZCU3-ZCK4  8    16     1   13.5     3  181451000   
8     9  007     2       ZCU3-ZCK4  8    16     1  13.75     8  181451000   
9     9  007     1       ZCU3-ZCN4  8    16     1      7     3  181451000   
10    9  007     2       ZCU3-ZCN4  8    16     1   7.25     8  181451000   
11    9  007     1       ZCU3-ZCU4  8    16     1  15.25     3  181451000   
12    9  007     2       ZCU3-ZCU4  8    16     1   15.5     1  181451000   
13    9  007     2       ZCU3-ZCU5  8    16     1  25.25     1  181451000   
14    9  007     1       ZCU3-ZCZ3  8    16     1     33     3  181451000   
15    9  007     2       ZCU3-ZCZ3  8    16     1  33.25    20  181451000   
16    9  007     2       ZCU3-ZCZ4  8    16     1   9.75     1  181451000   
17    9  007     2       ZCU3-ZCZ5  8    16     1  29.75     1  181451000   

   336       34   346 35      48   49                 52        75  \
0   ...      2   431875     3  X   19590  CME  20130715053428544  20130715   
1   ...   None     6613  None  f  164576  CME  20130714180206034  20130715   
2   ...      2  3177179  None  X    1647  CME  20130715181451830  20130715   
3   ...      2  3177179  None  X   21907  CME  20130715181451830  20130715   
4   ...      2  3177179  None  X   21907  CME  20130715181451830  20130715   
5   ...      2  3177179  None  X  504398  CME  20130715181451830  20130715   
6   ...      2  3177179  None  X  504398  CME  20130715181451830  20130715   
7   ...      2  3177179  None  X   21970  CME  20130715181451830  20130715   
8   ...      2  3177179  None  X   21970  CME  20130715181451830  20130715   
9   ...      2  3177179  None  X  558644  CME  20130715181451830  20130715   
10  ...      2  3177179  None  X  558644  CME  20130715181451830  20130715   
11  ...      2  3177179  None  X   22098  CME  20130715181451830  20130715   
12  ...      2  3177179  None  X   22098  CME  20130715181451830  20130715   
13  ...      2  3177179  None  X  504409  CME  20130715181451830  20130715   
14  ...      2  3177179  None  X  588275  CME  20130715181451830  20130715   
15  ...      2  3177179  None  X  588275  CME  20130715181451830  20130715   
16  ...      2  3177179  None  X  519645  CME  20130715181451830  20130715   
17  ...      2  3177179  None  X   19282  CME  20130715181451830  20130715   

        83     9  
0     1491   156  
1     None   104  
2   180151  1634  
3   118087  1634  
4   118088  1634  
5    29594  1634  
6    29595  1634  
7    50008  1634  
8    50009  1634  
9    48554  1634  
10   48555  1634  
11   52504  1634  
12   52505  1634  
13   16282  1634  
14  265779  1634  
15  265780  1634  
16   32899  1634  
17   20689  1634  

[18 rows x 24 columns]