Filtering selected columns based on column aggregate

I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.

Currently, I am solving it via python. Is there a more idiomatic way?

import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})

# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique

Equivalent in pandas:

df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]

Solution

The selected answer, though syntactically clean, is inefficient. You can do about better

Let us first include at least two filters rather than just one

Problem: Select only those columns where the number of unique values is between 1 and 200

The thing to consider is that you would need a pass over the data no matter what. So, reading it in is the first step

Then, if you do

pl.select(
    [s for s in df
     if s.n_unique() < 200 and s.n_unique() > 1]
)

You are computing the filters in sequence and also keeping them in memory. Htop confirms that using just one core of the machine The ideal solution is to do it all in parallel.

Let us do a few benchmarks. I am using a 32 cores machine. Parallelism would reduce the time further on machines with more cores

set up the dataframes:

import polars as pl
import numpy as np
df = pl.DataFrame({f'a_{i}':np.random.choice(['a','b','c','d'], 10000000) for i in range(100)})

This would take up about 20 GiB RAM. So, be careful if you want to replicate

Selected solution (htop confirms that this solution uses only one core)

%%timeit
_df = pl.select(
    [s for s in df
     if s.n_unique() < 200 and s.n_unique() > 1]
)
output:
18.7 s ± 92.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Let us now try to run the filters in parallel (htop confirms)

%%timeit
_df = df.select((pl.all().n_unique() < 200) & (pl.all().n_unique() > 1))
output:
1.35 s ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

We are still computing every filter twice in the two .n_unique() calls above. Let us do with just one by using in_between (parallel execution - htop confirms)

%%timeit
_df = df.select((pl.all().n_unique().is_between(1,200)))
output:
708 ms ± 21.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Btw, if you don't want to remember the APIs like in_between and also not compute the n_unique() twice, you can use the lazy semantics

df_lazy = df.lazy()

Now, try the above solution

%%timeit
_df = df_lazy.select((pl.all().n_unique() < 200) & (pl.all().n_unique() > 1)).collect()
output:
718 ms ± 15.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)