Search code examples
pythonpython-polars

Polars Selecting all columns without NaNs


I have a dataframe where a number of the columns only consists of NaNs. I am trying to select only the columns in the dataframe where all the values are not equal to NaNs using Polars.

I have tried seeing if I could use a similar syntax to how I would proceed in Pandas e.g.

df[df.columns[~df.isnull().all()]]

However the syntax doesn't translate. I also know that you can use pl.filter but this only filters rows and not columns based on the criteria's applied within the filter expression.


Solution

  • So this is basically subsetting columns with a boolean mask.

    So first let's create some sample data:

    import polars as pl
    import numpy as np
    
    df = pl.DataFrame({
        "a": [np.nan, np.nan, np.nan, np.nan],
        "b": [3.0, 4.0, np.nan, 5.0], 
        "c": [np.nan, np.nan, np.nan, np.nan]
    })
    

    Next we have to get if a column consists completely of NaN Values

    df.select(pl.all().is_not_nan().any())
    
    shape: (1, 3)
    ┌───────┬──────┬───────┐
    │ a     ┆ b    ┆ c     │
    │ ---   ┆ ---  ┆ ---   │
    │ bool  ┆ bool ┆ bool  │
    ╞═══════╪══════╪═══════╡
    │ false ┆ true ┆ false │
    └───────┴──────┴───────┘
    

    To get this DataFrame as iterable we use the row function

    df.select(pl.all().is_not_nan().any()).row(0)
    # (False, True, False)
    

    This we can now use in the bracket notation

    df[:, df.select(pl.all().is_not_nan().any()).row(0)]
    
    shape: (4, 1)
    ┌─────┐
    │ b   │
    │ --- │
    │ f64 │
    ╞═════╡
    │ 3.0 │
    │ 4.0 │
    │ NaN │
    │ 5.0 │
    └─────┘
    

    Since in general bracket notation is not recommended we can do this also with select: (for looking more concise we use the compress function from itertools)

    df.select(col for col in df if col.is_not_nan().any())
    
    shape: (4, 1)
    ┌─────┐
    │ b   │
    │ --- │
    │ f64 │
    ╞═════╡
    │ 3.0 │
    │ 4.0 │
    │ NaN │
    │ 5.0 │
    └─────┘