Search code examples
pythonnullpython-polars

Return all rows that have at least one null in one of the columns using Polars


I need all the rows that have null in one of the predefined columns. I basically need this but i have one more requirement that I cant seem to figure out. Not every column needs to be checked.

I have a function that returns the names of the columns that need to be checked in a list.

Assume this is my dataframe:

data = pl.from_repr("""
┌───────┬───────┬─────┬───────┐
│ a     ┆ b     ┆ c   ┆ d     │
│ ---   ┆ ---   ┆ --- ┆ ---   │
│ str   ┆ str   ┆ str ┆ bool  │
╞═══════╪═══════╪═════╪═══════╡
│ abc   ┆ null  ┆ u   ┆ true  │
│ def   ┆ abc   ┆ v   ┆ true  │
│ ghi   ┆ def   ┆ null┆ true  │
│ jkl   ┆ uvw   ┆ x   ┆ true  │
│ mno   ┆ xyz   ┆ y   ┆ null  │
│ qrs   ┆ null  ┆ z   ┆ null  │
└───────┴───────┴─────┴───────┘
""")

Doing data.filter(polars.any_horizontal(polars.all().is_null())) gives me all rows where any of the columns contain null.

Sometimes it's fine for column c to contain a null so let's not check it.

what I want is this:

┌───────┬───────┬─────┬───────┐
│ a     ┆ b     ┆ c   ┆ d     │
│ ---   ┆ ---   ┆ --- ┆ ---   │
│ str   ┆ str   ┆ str ┆ bool  │
╞═══════╪═══════╪═════╪═══════╡
│ abc   ┆ null  ┆ u   ┆ true  │
│ mno   ┆ xyz   ┆ y   ┆ null  │
│ qrs   ┆ null  ┆ z   ┆ null  │
└───────┴───────┴─────┴───────┘

Row 3 is not shown even though there is a null value in column c.

columns = ["a", "b", "d"]
data.filter(polars.any_horizontal(polars.all(*columns).is_null()))

This gives me polars.exceptions.SchemaError: invalid series dtype: expected 'Boolean', got 'str'

I thought maybe the columns aren't aligned or somethig because data has more columns than what the filter uses, so i did this.

columns = ["a", "b", "d"]
# notice `.select(columns)` here
data.select(columns).filter(polars.any_horizontal(polars.all(*columns).is_null()))

But is still get the same error. How do I get the full rows of data that contain a null in one of ["a", "b", "d"] columns


Solution

  • If you want exclude some columns you can use .exlude():

    import polars as pl
    
    data.filter(pl.any_horizontal(pl.exclude("c").is_null()))
    
    ┌─────┬──────┬─────┬──────┐
    │ a   ┆ b    ┆ c   ┆ d    │
    │ --- ┆ ---  ┆ --- ┆ ---  │
    │ str ┆ str  ┆ str ┆ bool │
    ╞═════╪══════╪═════╪══════╡
    │ abc ┆ null ┆ u   ┆ true │
    │ mno ┆ xyz  ┆ y   ┆ null │
    │ qrs ┆ null ┆ z   ┆ null │
    └─────┴──────┴─────┴──────┘
    

    Or you can just use column names by using .col():

    import polars as pl
    
    cols = ["a","b","d"]
    
    data.filter(pl.any_horizontal(pl.col(cols).is_null()))
    
    shape: (3, 4)
    ┌─────┬──────┬─────┬──────┐
    │ a   ┆ b    ┆ c   ┆ d    │
    │ --- ┆ ---  ┆ --- ┆ ---  │
    │ str ┆ str  ┆ str ┆ bool │
    ╞═════╪══════╪═════╪══════╡
    │ abc ┆ null ┆ u   ┆ true │
    │ mno ┆ xyz  ┆ y   ┆ null │
    │ qrs ┆ null ┆ z   ┆ null │
    └─────┴──────┴─────┴──────┘
    

    If you want to be really flexible, you can use selectors, for example .selectors.exclude():

    import polars.selectors as cs
    
    data.filter(pl.any_horizontal(cs.exclude("c").is_null()))
    
    shape: (3, 4)
    ┌─────┬──────┬─────┬──────┐
    │ a   ┆ b    ┆ c   ┆ d    │
    │ --- ┆ ---  ┆ --- ┆ ---  │
    │ str ┆ str  ┆ str ┆ bool │
    ╞═════╪══════╪═════╪══════╡
    │ abc ┆ null ┆ u   ┆ true │
    │ mno ┆ xyz  ┆ y   ┆ null │
    │ qrs ┆ null ┆ z   ┆ null │
    └─────┴──────┴─────┴──────┘