I have a dataframe where a number of the columns only consists of NaNs. I am trying to select only the columns in the dataframe where all the values are not equal to NaNs using Polars.
I have tried seeing if I could use a similar syntax to how I would proceed in Pandas e.g.
df[df.columns[~df.isnull().all()]]
However the syntax doesn't translate. I also know that you can use pl.filter but this only filters rows and not columns based on the criteria's applied within the filter expression.
So this is basically subsetting columns with a boolean mask.
So first let's create some sample data:
import polars as pl
import numpy as np
df = pl.DataFrame({
"a": [np.nan, np.nan, np.nan, np.nan],
"b": [3.0, 4.0, np.nan, 5.0],
"c": [np.nan, np.nan, np.nan, np.nan]
})
Next we have to get if a column consists completely of NaN Values
df.select(pl.all().is_not_nan().any())
shape: (1, 3)
┌───────┬──────┬───────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞═══════╪══════╪═══════╡
│ false ┆ true ┆ false │
└───────┴──────┴───────┘
To get this DataFrame as iterable we use the row function
df.select(pl.all().is_not_nan().any()).row(0)
# (False, True, False)
This we can now use in the bracket notation
df[:, df.select(pl.all().is_not_nan().any()).row(0)]
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ f64 │
╞═════╡
│ 3.0 │
│ 4.0 │
│ NaN │
│ 5.0 │
└─────┘
Since in general bracket notation is not recommended we can do this also with select: (for looking more concise we use the compress function from itertools)
df.select(col for col in df if col.is_not_nan().any())
shape: (4, 1)
┌─────┐
│ b │
│ --- │
│ f64 │
╞═════╡
│ 3.0 │
│ 4.0 │
│ NaN │
│ 5.0 │
└─────┘