I have some sample market data from CME DataMine. The files look like the following
'1128=9\x019=156\x0135=X\x0149=CME\x0134=431875\x0152=20130715053428544\x0175=20130715\x01268=1\x01279=1\x0122=8\x0148=19590\x0183=1491\x01107=ZCN4-ZCU4\x01269=0\x01270=0.75\x01271=7\x01273=53428000\x01336=2\x01346=3\x011023=3\x0110=233\x01'\n
'1128=9\x019=104\x0135=f\x0149=CME\x0134=6613\x0152=20130714180206034\x0122=8\x0148=164576\x0175=20130715\x01107=ZC:BF H4-K4-N4\x01332=159\x01333=-161\x0110=138\x01'\n
'1128=9\x019=1634\x0135=X\x0149=CME\x0134=3177179\x0152=20130715181451830\x0175=20130715\x01268=16\x01279=1\x0122=8\x0148=1647\x0183=180151\x01107=ZC:BF U3-Z3-H4\x01269=1\x01270=45\x01271=195\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=21907\x0183=118087\x01107=ZCU3-ZCH4\x01269=1\x01270=21\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=21907\x0183=118088\x01107=ZCU3-ZCH4\x01269=1\x01270=21.25\x01271=10\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=504398\x0183=29594\x01107=ZCU3-ZCH5\x01269=1\x01270=11.25\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=504398\x0183=29595\x01107=ZCU3-ZCH5\x01269=1\x01270=11.5\x01271=2\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=21970\x0183=50008\x01107=ZCU3-ZCK4\x01269=1\x01270=13.5\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=21970\x0183=50009\x01107=ZCU3-ZCK4\x01269=1\x01270=13.75\x01271=8\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=558644\x0183=48554\x01107=ZCU3-ZCN4\x01269=1\x01270=7\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=558644\x0183=48555\x01107=ZCU3-ZCN4\x01269=1\x01270=7.25\x01271=8\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=22098\x0183=52504\x01107=ZCU3-ZCU4\x01269=1\x01270=15.25\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=22098\x0183=52505\x01107=ZCU3-ZCU4\x01269=1\x01270=15.5\x01271=1\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=0\x0122=8\x0148=504409\x0183=16282\x01107=ZCU3-ZCU5\x01269=1\x01270=25.25\x01271=1\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=1\x0122=8\x0148=588275\x0183=265779\x01107=ZCU3-ZCZ3\x01269=1\x01270=33\x01271=3\x01273=181451000\x01276=K\x01336=2\x011023=1\x01279=1\x0122=8\x0148=588275\x0183=265780\x01107=ZCU3-ZCZ3\x01269=1\x01270=33.25\x01271=20\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=0\x0122=8\x0148=519645\x0183=32899\x01107=ZCU3-ZCZ4\x01269=1\x01270=9.75\x01271=1\x01273=181451000\x01276=K\x01336=2\x011023=2\x01279=0\x0122=8\x0148=19282\x0183=20689\x01107=ZCU3-ZCZ5\x01269=1\x01270=29.75\x01271=1\x01273=181451000\x01276=K\x01336=2\x011023=2\x0110=007\x01'\n
That is, in each row, fields are separated by the \x01 character, and each field contains entries in the form tag=value.
The tags and values correspond to those specified in the FIXML specs: https://www.onixs.biz/fix-dictionary/5.0.SP2/fields_by_tag.html
I would like to parse this file into a Pandas DataFrame where the column names are the tag names, the rows are the rows in the data file and the values are parsed values (so for example, Column 1 will be AppVerId, and row 1, column 1 will be: FIX50SP2).
How can I do this using pandas?
I have tried the following:
import pandas as pd
df = pd.read_table('data/corn/MDFF_CBT_20130714-20130715_7813_0',delimiter='\x01',header=None)
but get the following error:
ParserError: Error tokenizing data. C error: Expected 22 fields in line 3, saw 202
I am guessing this has to do with the fact that not all rows have the same number of columns.
Main problem is in third column tag are duplicated. So need defaultdict
for store:
from collections import defaultdict
rows = []
with open("CME.dat") as f:
for line in f:
row_dict = defaultdict(list)
# split each line
for element in line.split('\\x01'):
try:
tag, val = element.split('=')
row_dict[tag].append(val)
except:
pass
rows.append(row_dict)
#print (rows)
Then create DataFrame
but for scalar from lists first need replace NaN
s to []
, remove '
from columns names. Last use list comprehension with concat
, reshape by stack
and if necessary fill NaN
s per rows add groupby
with ffill
:
df = pd.DataFrame(rows).applymap(lambda x: x if isinstance(x, list) else [])
df.columns = df.columns.str.strip("'")
df = (pd.concat([pd.DataFrame(df[x].values.tolist()) for x in df], keys=df.columns, axis=1)
.stack()
.groupby(level=0)
.ffill()
.reset_index(drop=True))
print (df)
1128 10 1023 107 22 268 269 270 271 273 \
0 9 233 3 ZCN4-ZCU4 8 1 0 0.75 7 53428000
1 9 138 None ZC:BF H4-K4-N4 8 None None None None None
2 9 007 2 ZC:BF U3-Z3-H4 8 16 1 45 195 181451000
3 9 007 1 ZCU3-ZCH4 8 16 1 21 3 181451000
4 9 007 2 ZCU3-ZCH4 8 16 1 21.25 10 181451000
5 9 007 1 ZCU3-ZCH5 8 16 1 11.25 3 181451000
6 9 007 2 ZCU3-ZCH5 8 16 1 11.5 2 181451000
7 9 007 1 ZCU3-ZCK4 8 16 1 13.5 3 181451000
8 9 007 2 ZCU3-ZCK4 8 16 1 13.75 8 181451000
9 9 007 1 ZCU3-ZCN4 8 16 1 7 3 181451000
10 9 007 2 ZCU3-ZCN4 8 16 1 7.25 8 181451000
11 9 007 1 ZCU3-ZCU4 8 16 1 15.25 3 181451000
12 9 007 2 ZCU3-ZCU4 8 16 1 15.5 1 181451000
13 9 007 2 ZCU3-ZCU5 8 16 1 25.25 1 181451000
14 9 007 1 ZCU3-ZCZ3 8 16 1 33 3 181451000
15 9 007 2 ZCU3-ZCZ3 8 16 1 33.25 20 181451000
16 9 007 2 ZCU3-ZCZ4 8 16 1 9.75 1 181451000
17 9 007 2 ZCU3-ZCZ5 8 16 1 29.75 1 181451000
336 34 346 35 48 49 52 75 \
0 ... 2 431875 3 X 19590 CME 20130715053428544 20130715
1 ... None 6613 None f 164576 CME 20130714180206034 20130715
2 ... 2 3177179 None X 1647 CME 20130715181451830 20130715
3 ... 2 3177179 None X 21907 CME 20130715181451830 20130715
4 ... 2 3177179 None X 21907 CME 20130715181451830 20130715
5 ... 2 3177179 None X 504398 CME 20130715181451830 20130715
6 ... 2 3177179 None X 504398 CME 20130715181451830 20130715
7 ... 2 3177179 None X 21970 CME 20130715181451830 20130715
8 ... 2 3177179 None X 21970 CME 20130715181451830 20130715
9 ... 2 3177179 None X 558644 CME 20130715181451830 20130715
10 ... 2 3177179 None X 558644 CME 20130715181451830 20130715
11 ... 2 3177179 None X 22098 CME 20130715181451830 20130715
12 ... 2 3177179 None X 22098 CME 20130715181451830 20130715
13 ... 2 3177179 None X 504409 CME 20130715181451830 20130715
14 ... 2 3177179 None X 588275 CME 20130715181451830 20130715
15 ... 2 3177179 None X 588275 CME 20130715181451830 20130715
16 ... 2 3177179 None X 519645 CME 20130715181451830 20130715
17 ... 2 3177179 None X 19282 CME 20130715181451830 20130715
83 9
0 1491 156
1 None 104
2 180151 1634
3 118087 1634
4 118088 1634
5 29594 1634
6 29595 1634
7 50008 1634
8 50009 1634
9 48554 1634
10 48555 1634
11 52504 1634
12 52505 1634
13 16282 1634
14 265779 1634
15 265780 1634
16 32899 1634
17 20689 1634
[18 rows x 24 columns]