Search code examples
pythonpandasdataframepython-polars

Polars vs. Pandas: size and speed difference


I have a parquet file (~1.5 GB) which I want to process with polars. The resulting dataframe has 250k rows and 10 columns. One column has large chunks of texts in it.

I have just started using polars, because I heard many good things about it. One of which is that it is significantly faster than pandas.

Here is my issue / question:
The preprocessing of the dataframe is rather slow, so I started comparing to pandas. Am I doing something wrong or is polars for this particular use case just slower? If so: is there a way to speed this up?

Here is my code in polars

import polars as pl

df = (pl.scan_parquet("folder/myfile.parquet")
      .filter((pl.col("type")=="Urteil") | (pl.col("type")=="Beschluss"))
      .collect()
     )
df.head()

The entire code takes roughly 1 minute whereas just the filtering part takes around 13 seconds.

My code in pandas:

import pandas as pd 

df = (pd.read_parquet("folder/myfile.parquet")
    .query("type == 'Urteil' | type == 'Beschluss'") )
df.head()

The entire code also takes roughly 1 minute whereas just the querying part takes <1 second.

The dataframe has the following types for the 10 columns:

  • i64
  • str
  • struct[7]
  • str (for all remaining)

As mentioned: a column "content" stores large texts (1 to 20 pages of text) which I need to preprocess and the store differently I guess.

EDIT: removed the size part of the original post as the comparison was not like for like and does not appear to be related to my question.


Solution

  • Edit 30 January 2025:

    The answer below isn't true anymore and Polars switched to a faster implementation of the string type.

    As mentioned: a column "content" stores large texts (1 to 20 pages of text) which I need to preprocess and the store differently I guess.

    This is where polars must do much more work than pandas. Polars uses arrow memory format for string data. When you filter your DataFrame all the columns are recreated for where the mask evaluates to true.

    That means that all the text bytes in the string columns need to be moved around. Whereas for pandas they can just move the pointers to the python objects around, e.g. a few bytes.

    This only hurts if you have really large values as strings. E.g. when you are storing whole webpages for instance. You can speed this up by converting to categoricals.