Search code examples
pythonparsingcsvdialect

Reading a CSV-file with uncomplete headers


I have not used the csv module in python before, but it seems like a great tool to use for my needs.

The problem I am facing is that my csv-file looks a bit strange. (it was hard to notice at first)

See the example below for what I mean:

A|B|C|D|x|F|G|x|H
1||2||3|4||5|6
1||2||3|4||5|6
1||2||3|4||5|6
1||2||3|4||5|6

There are some indexes (such as column-x) above which should be read as Dx and Gx. The D-column and G-column are then empty columns.

So what I need is that when csv is processing the header it should create the headers like Dx and Gx and neglect the D and G columns and take the value from the x-column instead.

(I tried to read the file as is with the following code first and post process it, but it is not possible since the column name x is always the same, and csv.DictReader seems to discard repetitive column-names)

with open('myCSV.log') as csvFile:
    reader = csv.DictReader(csvFile, dialect='pipes')
    for row in reader:
        print row

Can this be achieved with a dialect, or some other fancy way of configuring the csv-module or do I have to parse this myself?


Solution

  • Use pandas:

    import pandas as pd
    print pd.read_csv('test.csv',sep='|')
    

    out:

       A   B  C   D  x  F   G  x.1  H
    0  1 NaN  2 NaN  3  4 NaN    5  6
    1  1 NaN  2 NaN  3  4 NaN    5  6
    2  1 NaN  2 NaN  3  4 NaN    5  6
    3  1 NaN  2 NaN  3  4 NaN    5  6
    

    and if you need you can convert it to a dict using to_dict() method:

    {'A': {0: 1, 1: 1, 2: 1, 3: 1},
     'B': {0: nan, 1: nan, 2: nan, 3: nan},
     'C': {0: 2, 1: 2, 2: 2, 3: 2},
     'D': {0: nan, 1: nan, 2: nan, 3: nan},
     'F': {0: 4, 1: 4, 2: 4, 3: 4},
     'G': {0: nan, 1: nan, 2: nan, 3: nan},
     'H': {0: 6, 1: 6, 2: 6, 3: 6},
     'x': {0: 3, 1: 3, 2: 3, 3: 3},
     'x.1': {0: 5, 1: 5, 2: 5, 3: 5}}
    

    EDIT: If you need certain names for columns you can do this:

    import pandas as pd
    df = pd.read_csv('test.csv',sep='|')
    df.columns = [df.columns[index-1]+'x' if 'x' in name 
                  else name for index,name in enumerate(df.columns)]
    print df
    
       A   B  C   D  Dx  F   G  Gx  H
    0  1 NaN  2 NaN   3  4 NaN   5  6
    1  1 NaN  2 NaN   3  4 NaN   5  6
    2  1 NaN  2 NaN   3  4 NaN   5  6
    3  1 NaN  2 NaN   3  4 NaN   5  6
    

    If you want to lose the empty cols:

    print df.dropna(axis=1,how='all')
    
       A  C  Dx  F  Gx  H
    0  1  2   3  4   5  6
    1  1  2   3  4   5  6
    2  1  2   3  4   5  6
    3  1  2   3  4   5  6