Search code examples
pythonpython-3.xpandashdf5pytables

DataFrame performance warning


I am getting performance warning from Pandas

/usr/local/lib/python3.4/dist-packages/pandas/core/generic.py:1471: 
PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed-integer,key->block0_values] [items->['int', 'str']]

I've read several issues on github and questions here, and all of them say this is because I mix types in one column, but I definetely not. The simple example is follows:

import pandas as pd
df = pd.DataFrame(columns=['int', 'str'])
df = df.append({ 'int': 0, 'str': '0'}, ignore_index=True)
df = df.append({ 'int': 1, 'str': '1'}, ignore_index=True)
for _, row in df.iterrows():
   print(type(row['int']), type(row['str']))

# <class 'int'> <class 'str'>
# <class 'int'> <class 'str'>

# however
df.dtypes
# int    object
# str    object
# dtype: object

# the following causes the warning
df.to_hdf('table.h5', 'table')

What can this be about and what can I do?


Solution

  • You need to convert your dataframe series to numeric types, where appropriate.

    There are 2 main ways to achieve this for integers:

    # Method 1
    df['col'] = df['col'].astype(int)
    
    # Method 2
    df['col'] = pd.to_numeric(df['col'], downcast='integer')
    

    This ensures data types are appropriately mapped to C-types, and thus enables data to be stored in HDF5 format (which PyTables uses) without the need for pickling.