Search code examples
pythonpandascsvmulti-index

Python: parse unnamed columns from csv


I would like to correctly read Unnamed columns from previously generated csv file.

The csv has a multiindex header with different shape for each column, ie, say, column 0 has one coordinate, column 1 has two and column 2 has three. Thus, when reading the full csv, Unnamed columns appear in column 0 and column 1. I need to locate a row with a given value but I cannot pass the correct column.

Imagine 'my.csv' file looks like this:

C0,C1,C2
,C10,C20
,,C22
-2,0,0.4101
-1,1,0.8058
0,0,0.1000
1,3,0.3846

I try to read it:

read_df = pd.read_csv('my.csv', header=[0,1,2]) 
print(read_df)

                  C0                 C1      C2
  Unnamed: 0_level_1                C10     C20
  Unnamed: 0_level_2 Unnamed: 1_level_2     C22
0                 -2                  0  0.4101
1                 -1                  1  0.8058
2                  0                  0  0.1000
3                  1                  3  0.3846

I need to select the rows corresponding to a value of C0, but I end up either in KeyError or in TypeError, such as in the following:

read_df.loc( read_df[('C0','','')] == 0 )  ## KeyError: ('C0', '', '')

read_df.loc( read_df[('C0','0_level_1','0_level_2')] == 0 ) ## KeyError: ('C0', '0_level_1', '0_level_2')

read_df.loc( read_df[('C0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2')] == 0 ) ## TypeError: 'Series' objects are mutable, thus they cannot be hashed

One turnaround could be to name the columns before hand and prevent this to happen, but it would be nice to understand how to manage this.

Thanks

PS here's how I generate 'my.csv'

import pandas as pd
import random as rnd

col_0 = [('C0', '', '')]
col_1 = [('C1', 'C10','')]
col_2 = [('C2', 'C20', 'C22')]

tot_col = columns=pd.MultiIndex.from_tuples(col_0 + col_1 + col_2)

tot_df = pd.DataFrame(columns=tot_col)

def get_data():
    data_dict = { ('C1','C10','')    : rnd.randint(0,5),
                  ('C2','C20','C22') : '{:2.4f}'.format(rnd.random()) }
    data_df = pd.DataFrame( [data_dict], columns=tot_col )
    return(data_df)

for ii in range(-2, 2):
    ii_df = get_data()
    ii_df[('C0','','')] = ii
    tot_df = pd.concat([tot_df, ii_df], ignore_index=True)

tot_df.to_csv('my.csv', index=False)

Solution

  • IIUC you're looking for

    read_df.loc[ read_df[('C0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2')] == 0 ]
    

    or simply

    read_df[read_df[('C0', 'Unnamed: 0_level_1', 'Unnamed: 0_level_2')] == 0]
    

    Result:

                      C0                 C1      C2
      Unnamed: 0_level_1                C10     C20
      Unnamed: 0_level_2 Unnamed: 1_level_2     C22
    2                  0                  4  0.2373
    




    Restoring the original index is a bit clumsy:

    read_df.columns = pd.MultiIndex.from_tuples([tuple('' if y.startswith('Unnamed:') else y for y in x) for x in map(list, read_df.columns.tolist())])
    

    After this you can use as usual:

    read_df[read_df[('C0','','')] == 0]
    

    Result:

      C0  C1      C2
         C10     C20
                 C22
    2  0   4  0.2373