Search code examples
pythonpandasmulti-index

pandas multi-index column header change type when reading a CSV file


I have a pandas data frame that uses multi-index for both column and rows. Here is a simplified example:

import pandas as pd
import datetime

col1 = [datetime.date(2018, 1, 1)+i*datetime.timedelta(days=1) for i in range(4) for j in range(2)]
col2 = [     'lunch',     'dinner',      'lunch',     'dinner',
             'lunch',     'dinner',      'lunch',     'dinner']

hdr1 = ['starter', 'starter', 'main', 'main', ' main', 'dessert', 'dessert']
hdr2 = [        0,         1,      0,      1,       2,         0,         1]

col_index = pd.MultiIndex.from_arrays([col1, col2], names=['date', 'meal'])
row_index = pd.MultiIndex.from_arrays([hdr1, hdr2], names=['dish', 'content'])

df = pd.DataFrame(index=col_index, columns=row_index)

If I try to print a specific cell, it works fine:

In [45]: df.loc[('2018-01-02', 'lunch'), ('starter', 0)]
Out[45]: 
date        meal 
2018-01-02  lunch    NaN
Name: (starter, 0), dtype: object

However, if I save it to a CSV file and read it again:

# write to CSV
df.to_csv('test.csv')

# read from CSV
df = pd.read_csv('test.csv', index_col=[0, 1], header=[0, 1], parse_dates=True)

Here is the result of the same command:

In [47]: df.loc[('2018-01-02', 'lunch'), ('starter', 0)]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-47-d0a7081dd150> in <module>()
----> 1 df.loc[('2018-01-02', 'lunch'), ('starter', 0)]
[...]

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: ('starter', 0)

The reason is that when reading the data frame from the CSV file, the second level of the column index is not an integer anymore, but a string:

In [55]: df.columns.get_level_values(1)
Out[55]: Index(['0', '1', '0', '1', '2', '0', '1'], dtype='object', name='content')

Any idea of how to force the second level index to be read as an integer instead of a string?


Solution

  • I fear the only option is to just bite the bullet and fix your headers; use pd.MultiIndex.set_levels.

    df.columns = df.columns.set_levels(
          df.columns.levels[1].astype(int), level=1
    )
    

    <!- ->

    df.loc[('2018-01-02', 'lunch'), ('starter', 0)]
    
    date        meal 
    2018-01-02  lunch    NaN
    Name: (starter, 0), dtype: object