Search code examples
python-3.xpandasmulti-index

How to format the csv file with df.to_csv for a multiindex dataframe, python3


I have a multi-indexed dataframe,

>>> df
       a1      a2    
       b1  b2  b1  b2
c1 d1  11  21  31  41
   d2  12  22  32  42
c2 d1  13  23  33  43
   d2  14  24  34  44

It has 2 levels of header and 2 levels of index. If I directly use the code df.to_csv('test_file.csv'), then the format of the file test_file.csv is

,,a1,a1,a2,a2
,,b1,b2,b1,b2
c1,d1,11,21,31,41
c1,d2,12,22,32,42
c2,d1,13,23,33,43
c2,d2,14,24,34,44

However, I would like to change it to

  1. remove the duplicates in the 1st level of header
  2. remove entire 1st level of index, and make an empty row for each one in the 1st level of index.

The wanted format is:

,a1,,a2,
,b1,b2,b1,b2
c1,,,,,
d1,11,21,31,41
d2,12,22,32,42
c2,,,,,
d1,13,23,33,43
d2,14,24,34,44

Could you please show me how to do it? Thanks! Please use the code below.

import pandas as pd


df = pd.DataFrame(
    {
        ('a1', 'b1'): [11, 12, 13, 14],
        ('a1', 'b2'): [21, 22, 23, 24],
        ('a2', 'b1'): [31, 32, 33, 34],
        ('a2', 'b2'): [41, 42, 43, 44],
    },
    index=pd.MultiIndex.from_tuples([
        ('c1', 'd1'),
        ('c1', 'd2'),
        ('c2', 'd1'),
        ('c2', 'd2'),
    ]),
)

print(df)
df.to_csv('my_test_file.csv')

Solution

  • Here is a working solution. It uses a helper function to remove the duplicated consecutive labels and groupy+apply+pandas.concat to shift the multiindex level as extra empty row:

    def remove_consecutive(l):
        '''replaces consecutive items in "l" with empty string'''
        from itertools import groupby, chain
        return tuple(chain(*([k]+['']*(len(list(g))-1) for k,g in groupby(l))))
    
    (df.groupby(level=0)
       # below to shift level=0 as new row
       .apply(lambda g: pd.concat([pd.DataFrame([],
                                                index=[g.name],
                                                columns=g.columns),
                                   g.droplevel(0)]))
      .droplevel(0)
      # below to remove the duplicate column names
      .T # unfortunately there is no set_index equivalent for columns, so transpose before/after
      .set_index(pd.MultiIndex.from_arrays(list(map(remove_consecutive, zip(*df.columns)))))
      .T
      # export
      .to_csv('filename.csv')
    )
    

    output:

    ,a1,,a2,
    ,b1,b2,b1,b2
    c1,,,,
    d1,11,21,31,41
    d2,12,22,32,42
    c1,,,,
    d1,13,23,33,43
    d2,14,24,34,44