I have not used the csv module
in python
before, but it seems like a great tool to use for my needs.
The problem I am facing is that my csv-file
looks a bit strange. (it was hard to notice at first)
See the example below for what I mean:
A|B|C|D|x|F|G|x|H
1||2||3|4||5|6
1||2||3|4||5|6
1||2||3|4||5|6
1||2||3|4||5|6
There are some indexes (such as column-x) above which should be read as Dx and Gx. The D-column and G-column are then empty columns.
So what I need is that when csv is processing the header it should create the headers like Dx and Gx and neglect the D and G columns and take the value from the x-column instead.
(I tried to read the file as is with the following code first and post process it, but it is not possible since the column name x is always the same, and csv.DictReader
seems to discard repetitive column-names)
with open('myCSV.log') as csvFile:
reader = csv.DictReader(csvFile, dialect='pipes')
for row in reader:
print row
Can this be achieved with a dialect
, or some other fancy way of configuring the csv-module
or do I have to parse this myself?
Use pandas:
import pandas as pd
print pd.read_csv('test.csv',sep='|')
out:
A B C D x F G x.1 H
0 1 NaN 2 NaN 3 4 NaN 5 6
1 1 NaN 2 NaN 3 4 NaN 5 6
2 1 NaN 2 NaN 3 4 NaN 5 6
3 1 NaN 2 NaN 3 4 NaN 5 6
and if you need you can convert it to a dict using to_dict()
method:
{'A': {0: 1, 1: 1, 2: 1, 3: 1},
'B': {0: nan, 1: nan, 2: nan, 3: nan},
'C': {0: 2, 1: 2, 2: 2, 3: 2},
'D': {0: nan, 1: nan, 2: nan, 3: nan},
'F': {0: 4, 1: 4, 2: 4, 3: 4},
'G': {0: nan, 1: nan, 2: nan, 3: nan},
'H': {0: 6, 1: 6, 2: 6, 3: 6},
'x': {0: 3, 1: 3, 2: 3, 3: 3},
'x.1': {0: 5, 1: 5, 2: 5, 3: 5}}
EDIT: If you need certain names for columns you can do this:
import pandas as pd
df = pd.read_csv('test.csv',sep='|')
df.columns = [df.columns[index-1]+'x' if 'x' in name
else name for index,name in enumerate(df.columns)]
print df
A B C D Dx F G Gx H
0 1 NaN 2 NaN 3 4 NaN 5 6
1 1 NaN 2 NaN 3 4 NaN 5 6
2 1 NaN 2 NaN 3 4 NaN 5 6
3 1 NaN 2 NaN 3 4 NaN 5 6
If you want to lose the empty cols:
print df.dropna(axis=1,how='all')
A C Dx F Gx H
0 1 2 3 4 5 6
1 1 2 3 4 5 6
2 1 2 3 4 5 6
3 1 2 3 4 5 6