Search code examples
pandaspyarrow

Why does a pyarrow backend df need more RAM than a numpy backend?


I am reading a large parquet file with int, string and date columns. When using dtype_backend="pyarrow" instead of default dtype_backend="numpy_nullable", I get 15.6 GB instead of 14.6 GB according to df.info(). Furthermore, I experienced even larger relative overhead using pyarrow for other datasets.

code:

pd.read_parquet("df.parquet", dtype_backend="numpy_nullable").info()

dtypes: Int16(1), Int32(2), datetime64ns, UTC, string(1), timedelta64ns memory usage: 14.6 GB

pd.read_parquet("df.parquet", dtype_backend="pyarrow").info()

dtypes: duration[ns]pyarrow, int16pyarrow, int32pyarrow, stringpyarrow, timestamp[ns, tz=UTC]pyarrow memory usage: 15.6 GB

Is this the expected behaviour or do I have to tweak other parameters as well?

I'm using pandas[parquet] ~= 2.1.3


Solution

  • I believe pd.DataFrame.info gives you the shallow representation of memory when using numpy as a backend. So it won't give you an accurate representation of memory usage for string columns.

    On the other hand, reported memory usage for pyarrow is always accurate.

    You should use memory_usage(deep=True)

    import pandas as pd
    
    df = pd.DataFrame({"col1": ["abc", "efg"]})
    
    (
        df.memory_usage().sum(),
        df.memory_usage(deep=True).sum(),
        df.astype({"col1": "string[pyarrow]"}).memory_usage().sum(),
        df.astype({"col1": "string[pyarrow]"}).memory_usage(deep=True).sum(),
    )
    
    

    Gives me: 144, 248, 142, 142