Search code examples
pythonpython-3.xpython-polarslazyframe

What are the advantages of a polars LazyFrame over a Dataframe?


Python polars are pretty similar to Python pandas.

I know in pandas we do not have Lazyframes.

We can create Lazyframes just like Dataframes in polars.

import polars as pl
data = {"a": [1, 2, 3], "b": [5, 4, 8]}
lf = pl.LazyFrame(data)

I want to know what are the advantages of a Lazyframe over a Dataframe ?

If someone can explain with examples.

Thanks.


Solution

  • I think this is very good explained at the Polars docs:

    With the lazy API Polars doesn't run each query line-by-line but instead processes the full query end-to-end. To get the most out of Polars it is important that you use the lazy API because:

    the lazy API allows Polars to apply automatic query optimization with the query optimizer
    the lazy API allows you to work with larger than memory datasets using streaming
    the lazy API can catch schema errors before processing the data
    

    Here we see how to use the lazy API starting from either a file or an existing DataFrame.

    So in short in both cases you code your transformations. In a normal data-frame these transformations will be executed one by one, in the lazy case a "query optimizer" will look for shortcuts in the algorithm for reaching the same result.

    Notice that these shortcuts are not necessary and also not always present, so in the worst case the Lazy operation will just perform as the traditional one.

    Example

    As an example imagine that you need to Read a CSV transform it and filter it, in pandas we will:

    # Read everything even if we don't need everything
    df = pd.read_csv("example.csv") 
    # Potentially relocate memory or duplicate memory usage  
    df = df[['name','col2']]
    # We are transforming everything even if we will filter later
    df['name'] = df['name'].str.upper()
    # Same as before, and could have been avoided by not reading in the first place, or putting the transformation later
    df = df[df['col2']>0]
    

    In this process we may have allocated much more memory than the required for the original DF and in general we could have read the CSV file once already looking just for 'col1'>1 and 'col2'.

    There are optimizations we can do in the code to optimize it such as filtering first and transforming later. But let's check what a lazy operation can do does.

    Again from the docs of polar. we could have make the "query" for the whole process:

    q1 = (
        pl.scan_csv(f"docs/src/data/reddit.csv")
        .with_columns(pl.col("name").str.to_uppercase())
        .filter(pl.col("comment_karma") > 0) )
    
    # that reads in polars as :
    FILTER [(col("comment_karma")) > (0)] FROM WITH_COLUMNS:
        [col("name").str.uppercase()]
    
        CSV SCAN data/reddit.csv
        PROJECT */6 COLUMNS
    

    Then the polars optimizer will transform it (unless you tell them not to do it), so the actual operation looks like:

     WITH_COLUMNS:
     [col("name").str.uppercase()]
    
        CSV SCAN data/reddit.csv
        PROJECT */6 COLUMNS
        SELECTION: [(col("comment_karma")) > (0)]
    

    This is what polars executes and basically it filters and transform the df at the time of reading, altogether, saving computation and time.

    tldr:

    The LazyDataframe is meant to allow you to use the lazy API based on query optimization that shortcuts the amount of calculations needed by intelligently reordering operations instead of blindly executing them in order as it is usual in pandas.

    Of course for using the Lazy API we need to defers the actual execution of the code till when is needed so it can rearrange all operations.