Search code examples
pythonmemorypython-polarslazyframe

How to check if a LazyFrame is empty?


Polars dataframes have an is_empty attribute:

import polars as pl

df = pl.DataFrame()
df.is_empty()  # True

df = pl.DataFrame({"a": [], "b": [], "c": []})
df.is_empty()  # True

This is not the case for Polars lazyframes, so I devised the following helper function:

def is_empty(data: pl.LazyFrame) -> bool:
    return (
        data.width == 0  # No columns
        or data.null_count().collect().sum_horizontal()[0] == 0  # Columns exist, but are empty
    )


other = pl.LazyFrame()
other.pipe(is_empty)  # True

other = pl.LazyFrame({"a": [], "b": [], "c": []})
other.pipe(is_empty)  # True

Is there a better way to do this? By better, I mean either without collecting or less memory-intensive if collecting can not be avoided.


Solution

  • As explained in the comments, "A LazyFrame doesn't have length. It is a promise on future computations. If we would do those computations implicitly, we would trigger a lot of work silently. IMO when the length is needed, you should materialize into a DataFrame and cache that DataFrame so that that work isn't done twice".

    So, calling collect is inevitable, but one can limit the cost by collecting only the first row (if any) with Polars limit, as suggested by @Timeless:

    import polars as pl
    
    df = pl.LazyFrame()
    df.limit(1).collect().is_empty()  # True
    
    df= pl.LazyFrame({"a": [], "b": [], "c": []})
    df.limit(1).collect().is_empty()  # True
    
    df = pl.LazyFrame({col: range(100_000_000) for col in ("a", "b", "c")})
    df.limit(1).collect().is_empty()  # False, no memory cost