When using DataFrame.read_csv with multi level columns (read with header=
) pandas seems to ignore the dtype=
keyword.
Is there a way to make pandas use the passed types?
I am reading large data sets from CSV and therefore try to read the data already in the correct format to save CPU and memory.
I tried passing a dict using dtype with tuples as well as strings. It seems that dtype expects strings. At least I observed, that if I pass the level 0 keys the types are assigned, but unfortunately that would mean that all columns with the same level 0 label would get the same type. In the esample below columns (A, int16) and (A, int32) would get type object and (B, float32) and (B, int16) would get float32.
import pandas as pd
df= pd.DataFrame({
('A', 'int16'): pd.Series([1, 2, 3, 4], dtype='int16'),
('A', 'int32'): pd.Series([132, 232, 332, 432], dtype='int32'),
('B', 'float32'): pd.Series([1.01, 1.02, 1.03, 1.04], dtype='float32'),
('B', 'int16'): pd.Series([21, 22, 23, 24], dtype='int16')})
print(df)
df.to_csv('test_df.csv')
print(df.dtypes)
<i># full column name tuples with level 0/1 labels don't work</i>
df_new= pd.read_csv(
'test_df.csv',
header=list(range(2)),
dtype = {
('A', 'int16'): 'int16',
('A', 'int32'): 'int32'
})
print(df_new.dtypes)
<i># using the level 0 labels for dtype= seems to work</i>
df_new2= pd.read_csv(
'test_df.csv',
header=list(range(2)),
dtype={
'A':'object',
'B': 'float32'
})
print(df_new2.dtypes)
I'd expect the second print(df.dtypes)
to output the same column types as the first print(df.dtypes)
, but it does not seem to use the dtype=
argument at all and infers the types resulting in much more memory intense types.
Was I missing something?
Thank you in advance Jottbe
This is a bug, that is also present in the current version of pandas. I filed a bug report here.
But also for the current version, there is a workaround. It works perfectly, if the engine is switched to python:
df_new= pd.read_csv(
'test_df.csv',
header=list(range(2)),
engine='python',
dtype = {
('A', 'int16'): 'int16',
('A', 'int32'): 'int32'
})
print(df_new.dtypes)
The output is:
Unnamed: 0_level_0 Unnamed: 0_level_1 int64
A int16 int16
int32 int32
B float32 float64
int16 int64
So the "A-columns" are typed as specified in dtypes
.