Polars DF takes up lots of RAM

I'm running a python script that analyses a dataframe (loaded from a parquet file). I've ran my program using the memory_profiler package to see the mem signature it has - on 2 DataFrames with a size of 9 & 20 mb (obtained by using df.estimated_size()), I'm seeing a whopping 280 mb memory usage (also ran on docker and analyzed statistics to make sure im in a clean env).

So I've created a simple script that just loads the data into a dataframe - I'm still seeing the 20mb df increasing mem usage by 70mb, and the 9mb df by 30mb.

These are the results of the estimated_size and the memory_profiler with the script:

f 20.546353340148926 # estimated size, mb
g 9.469635009765625 # estimated size, mb


 7     32.8 MiB     32.8 MiB           1   @profile
 8                                         def run():
 9                                         
10    105.1 MiB     72.2 MiB           1       pldf_f = pl.read_parquet('./benchmark/data/f_mock_data_100-records.parquet')
11    136.1 MiB     31.0 MiB           1       pldf_g = pl.read_parquet('./benchmark/data/g_mock_data_100-records.parquet')
12                                         
13    136.2 MiB      0.1 MiB           1       print(f'f {pldf_f.estimated_size("mb")}')
14    136.2 MiB      0.0 MiB           1       print(f'g {pldf_g.estimated_size("mb")}')

I thought Polars is supposed to have a low mem impact, and am seeing results of 3x the dataframe size.

This is of course just mock test data, I'm expected to analyze records in the hunders-of-thousands range if not millions.

Am I reading the results wrong?

Solution

Memory allocators

Memory shown by a process is the aggregate of the memory in data structures. During heap allocation and deallocation, the allocator will keep hold of the memory pages and not return them to the OS.

This is desired behavior as the memory allocator can use that memory later and therefore this will improve performance.

N x table size

3x a table size is actually pretty good. Consider you have a memory buffer and you need to deserialize that data into another buffer, that already gives you 2X.

All data in RAM

Used RAM is not so bad, it is there to be used. We could optimize for minimal RAM usage but that would hurt performance and would be a waste of available resources.

If your data doesn't fit into RAM, try the polars streaming API. This will process the data in batches, spill to disk when needed and allows you to materialize on disk rather than in RAM.

Shrink

You can shrink the DataFrame size by calling df.shrink_to_fit(), but I would advice against it if it is not needed.