Search code examples
python-polars

How can I execute a set of functions on all columns in a Polars dataframe?


I'm trying to execute multiple operations on all columns on a polars dataframe using the pl.all function.

df = pl.DataFrame(
    {"a": [1, 2, 3], "b": [4, 5, 6], "c": [1, 1, 1]}
)

For the time being, I am using this code:

def col_operations(column: pl.Series):
    col_stats = {
                "count_unique": column.n_unique(),
                "count_nans": column.null_count(),
                "nan_frac": column.null_count()/len(column)        
                 }
    
    return col_stats

def profile(df: pl.DataFrame):
    profiling_dict = {}
    for col in df.columns:
        profiling_dict[col] = col_operations(df[col])
    
    return pd.DataFrame(profiling_dict) # NOTE: pandas

profile(df)

which returns what I want to see:

                a   b   c
count_unique    3.0 3.0 1.0
count_nans      0.0 0.0 0.0
nan_frac        0.0 0.0 0.0

However, for my actual use case I am working on tables that can have ~15 columns and potentially millions of rows, so I'm looking for something faster than what I came up with.

I tried to use pl.all as argument to df.select, but I get an error.

df.select(pl.all().sum(), pl.all().mean())
# DuplicateError: the name 'a' is duplicate

Adding a suffix seems to work:

df.select(pl.all().sum().name.suffix("_sum"), pl.all().first().name.suffix("_first"))
shape: (1, 6)
┌───────┬───────┬───────┬─────────┬─────────┬─────────┐
│ a_sum ┆ b_sum ┆ c_sum ┆ a_first ┆ b_first ┆ c_first │
│ ---   ┆ ---   ┆ ---   ┆ ---     ┆ ---     ┆ ---     │
│ i64   ┆ i64   ┆ i64   ┆ i64     ┆ i64     ┆ i64     │
╞═══════╪═══════╪═══════╪═════════╪═════════╪═════════╡
│ 6     ┆ 15    ┆ 3     ┆ 1       ┆ 4       ┆ 1       │
└───────┴───────┴───────┴─────────┴─────────┴─────────┘

This is kind of what I want to see, however I would rather have a format like that in the first example.

With Pandas, I can get this done with .apply:

def col_operations(column: pd.Series):
    col_stats = {
                "count_unique": column.nunique(),
                "count_nans": column.isna().sum(),
                "nan_frac": column.isna().sum()/len(column)        
                 }
    
    return pd.Series(col_stats)
df.to_pandas().apply(col_operations, axis=0)
                a    b    c
count_unique  3.0  3.0  1.0
count_nans    0.0  0.0  0.0
nan_frac      0.0  0.0  0.0

Is this doable in some way using Polars?


Solution

  • I think you really want to expressionize your col_operations and profile methods.

    Let's deconstruct col_operations.

    "count_unique": column.n_unique(),
    "count_nans": column.null_count(),
    "nan_frac": column.null_count()/len(column)  
    

    We can use the documented means to register our own Exprs

    Method 1 (do everything wide then unpivot/split/pivot

    @pl.api.register_expr_namespace("col_ops")
    class AnyNameHereIsFine:
        def __init__(self, expr):
            self._expr = expr
        def n_unique(self):
            return self._expr.n_unique().name.suffix("_____count_unique")
        def null_count(self):
            return self._expr.null_count().name.suffix("_____count_nans")
        def nan_frac(self):
            return (self._expr.null_count()/self._expr.count()).name.suffix("_____nan_frac")
    

    Now we can get dir and getattr to do it all.

    df.select(getattr(pl.all().col_ops, x)() for x in dir(pl.Expr.col_ops) if x[:1]!="_")
    

    Of course, you don't want the wide format....who would?

    From here you've got to do unpivot/split/pivot

    (
        df.select(getattr(pl.all().col_ops, stat_method)() for stat_method in dir(pl.Expr.col_ops) if stat_method[:1]!="_") 
        .unpivot()
        .with_columns(varsplit=pl.col('variable').str.split("_____"))
        .with_columns(c1=pl.col('varsplit').list.first(),
                      c2=pl.col('varsplit').list.last())
        .pivot(on='c1',index='c2',values='value')
    )
    

    Method 2 (struct/unnest/concat)

    @pl.api.register_expr_namespace("col_ops")
    class AnyNameHereIsFine:
        def __init__(self, expr):
            self._expr = expr
        def n_unique(self):
            return pl.struct([pl.lit("count_unique").alias("stat"), self._expr.n_unique().cast(pl.Float64())])
        def null_count(self):
            return pl.struct([pl.lit("count_nans").alias("stat"), self._expr.null_count().cast(pl.Float64())])
        def nan_frac(self):
            return pl.struct([pl.lit("nan_frac").alias("stat"), (self._expr.null_count()/self._expr.count()).cast(pl.Float64())])
    

    In this method, we create structs at the outset which are essentially a means of having one expression return two columns. In the struct is one column labeling the stat and then of course the stat itself. Then we do it a row at a time and concat all those rows into a final answer like this:

    pl.concat(
        df.select(getattr(pl.all().col_ops, stat_method)()).unnest('stat') 
            for stat_method in dir(pl.Expr.col_ops) if stat_method[:1]!="_")
    

    This method might benefit from using lazy frames. ie, instead of that, do:

    df=df.lazy()
    pl.concat(
        df.select(getattr(pl.all().col_ops, stat_method)()).unnest('stat') 
            for stat_method in dir(pl.Expr.col_ops) if stat_method[:1]!="_").collect()
    

    The benefit to lazy frames comes from here specifically this:

    parallel

    Only relevant for LazyFrames. This determines if the concatenated lazy computations may be executed in parallel.

    In the first method, we get the benefit of parallelization by virtue of doing all the stats in the same expression but then we have to do all the unpivoting and pivoting afterwards which eats up some of those benefits. In this method we're never splitting strings, unpivoting, or pivoting but we're doing the calc one method at a time rather than all at once which might have a penalty. By doing it with lazy frames, concat can work in parallel thereby giving that benefit back (maybe).

    props to @ΩΠΟΚΕΚΡΥΜΜΕΝΟΣ for this answer