Search code examples
pandascsvdataframeheader

dtype is ignored when using multilevel columns


When using DataFrame.read_csv with multi level columns (read with header=) pandas seems to ignore the dtype= keyword. Is there a way to make pandas use the passed types? I am reading large data sets from CSV and therefore try to read the data already in the correct format to save CPU and memory.

I tried passing a dict using dtype with tuples as well as strings. It seems that dtype expects strings. At least I observed, that if I pass the level 0 keys the types are assigned, but unfortunately that would mean that all columns with the same level 0 label would get the same type. In the esample below columns (A, int16) and (A, int32) would get type object and (B, float32) and (B, int16) would get float32.

import pandas as pd
    df=  pd.DataFrame({
        ('A', 'int16'):   pd.Series([1, 2, 3, 4], dtype='int16'),
        ('A', 'int32'):   pd.Series([132, 232, 332, 432], dtype='int32'), 
        ('B', 'float32'): pd.Series([1.01, 1.02, 1.03, 1.04], dtype='float32'),
        ('B', 'int16'):   pd.Series([21, 22, 23, 24], dtype='int16')})
    print(df)
    df.to_csv('test_df.csv')
    print(df.dtypes)
    <i># full column name tuples with level 0/1 labels don't work</i>
    df_new= pd.read_csv(
        'test_df.csv',
        header=list(range(2)),
        dtype = {
            ('A', 'int16'): 'int16', 
            ('A', 'int32'): 'int32'
        })
    print(df_new.dtypes)
    <i># using the level 0 labels for dtype= seems to work</i>
    df_new2= pd.read_csv(
        'test_df.csv', 
        header=list(range(2)), 
        dtype={
            'A':'object', 
            'B': 'float32'
        })
    print(df_new2.dtypes)

I'd expect the second print(df.dtypes) to output the same column types as the first print(df.dtypes), but it does not seem to use the dtype= argument at all and infers the types resulting in much more memory intense types.

Was I missing something?

Thank you in advance Jottbe


Solution

  • This is a bug, that is also present in the current version of pandas. I filed a bug report here.

    But also for the current version, there is a workaround. It works perfectly, if the engine is switched to python:

    df_new= pd.read_csv(
        'test_df.csv',
        header=list(range(2)),
        engine='python',
        dtype = {
            ('A', 'int16'): 'int16', 
            ('A', 'int32'): 'int32'
        })
    print(df_new.dtypes)
    

    The output is:

    Unnamed: 0_level_0  Unnamed: 0_level_1      int64
    A                   int16                   int16
                        int32                   int32
    B                   float32               float64
                        int16                   int64
    

    So the "A-columns" are typed as specified in dtypes.