I need all the rows that have null in one of the predefined columns. I basically need this but i have one more requirement that I cant seem to figure out. Not every column needs to be checked.
I have a function that returns the names of the columns that need to be checked in a list.
Assume this is my dataframe:
data = pl.from_repr("""
┌───────┬───────┬─────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═══════╪═══════╪═════╪═══════╡
│ abc ┆ null ┆ u ┆ true │
│ def ┆ abc ┆ v ┆ true │
│ ghi ┆ def ┆ null┆ true │
│ jkl ┆ uvw ┆ x ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└───────┴───────┴─────┴───────┘
""")
Doing
data.filter(polars.any_horizontal(polars.all().is_null()))
gives me all rows where any of the columns contain null
.
Sometimes it's fine for column c
to contain a null
so let's not check it.
what I want is this:
┌───────┬───────┬─────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═══════╪═══════╪═════╪═══════╡
│ abc ┆ null ┆ u ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└───────┴───────┴─────┴───────┘
Row 3 is not shown even though there is a null value in column c.
columns = ["a", "b", "d"]
data.filter(polars.any_horizontal(polars.all(*columns).is_null()))
This gives me
polars.exceptions.SchemaError: invalid series dtype: expected 'Boolean', got 'str'
I thought maybe the columns aren't aligned or somethig because data has more columns than what the filter uses, so i did this.
columns = ["a", "b", "d"]
# notice `.select(columns)` here
data.select(columns).filter(polars.any_horizontal(polars.all(*columns).is_null()))
But is still get the same error.
How do I get the full rows of data that contain a null
in one of ["a", "b", "d"]
columns
If you want exclude some columns you can use .exlude()
:
import polars as pl
data.filter(pl.any_horizontal(pl.exclude("c").is_null()))
┌─────┬──────┬─────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪══════╪═════╪══════╡
│ abc ┆ null ┆ u ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└─────┴──────┴─────┴──────┘
Or you can just use column names by using .col()
:
import polars as pl
cols = ["a","b","d"]
data.filter(pl.any_horizontal(pl.col(cols).is_null()))
shape: (3, 4)
┌─────┬──────┬─────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪══════╪═════╪══════╡
│ abc ┆ null ┆ u ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└─────┴──────┴─────┴──────┘
If you want to be really flexible, you can use selectors
, for example .selectors.exclude()
:
import polars.selectors as cs
data.filter(pl.any_horizontal(cs.exclude("c").is_null()))
shape: (3, 4)
┌─────┬──────┬─────┬──────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ bool │
╞═════╪══════╪═════╪══════╡
│ abc ┆ null ┆ u ┆ true │
│ mno ┆ xyz ┆ y ┆ null │
│ qrs ┆ null ┆ z ┆ null │
└─────┴──────┴─────┴──────┘