Search code examples
pythondataframepython-polars

Column- and row-wise logical operations on Polars DataFrame


In Pandas, one can perform boolean operations on boolean DataFrames with the all and any methods, providing an axis argument. For example:

import pandas as pd

data = dict(A=["a","b","?"], B=["d","?","f"])
pd_df = pd.DataFrame(data)

For example, to get a boolean mask on columns containing the element "?":

(pd_df == "?").any(axis=0)

and to get a mask on rows:

(pd_df == "?").any(axis=1)

Also, to get a single boolean:

(pd_df == "?").any().any()

In comparison, in polars, the best I could come up with are the following:

import polars as pl
pl_df = pl.DataFrame(data)

To get a mask on columns:

(pl_df == "?").select(pl.all().any())

To get a mask on rows:

pl_df.select(
    pl.concat_list(pl.all() == "?").alias("mask")
).select(
    pl.col("mask").list.eval(pl.element().any()).list.first()
)

And to get a single boolean value:

pl_df.select(
    pl.concat_list(pl.all() == "?").alias("mask")
).select(
    pl.col("mask").list.eval(pl.element().any()).list.first()
)["mask"].any()

The last two cases seem particularly verbose and convoluted for such a straightforward task, so I'm wondering whether there are shorter/faster equivalents?


Solution

  • I think one thing that might be making this more confusing is that you're not doing everything in the select context. In other words, don't do this: (pl_df == "?")

    The first thing we can do is

    pl_df.select(pl.all()=="?")
    shape: (3, 2)
    ┌───────┬───────┐
    │ A     ┆ B     │
    │ ---   ┆ ---   │
    │ bool  ┆ bool  │
    ╞═══════╪═══════╡
    │ false ┆ false │
    │ false ┆ true  │
    │ true  ┆ false │
    └───────┴───────┘
    

    When we call pl.all() it means all of the columns. For each column we're converting its original value into a bool of whether or not it's equal to ?

    Now let's do this:

    pl_df.select((pl.all()=="?").any())
    
    shape: (1, 2)
    ┌──────┬──────┐
    │ A    ┆ B    │
    │ ---  ┆ ---  │
    │ bool ┆ bool │
    ╞══════╪══════╡
    │ true ┆ true │
    └──────┴──────┘
    

    This gives you the per column. All we did was add .any which tells it that if anything in the parenthesis that preceded it is true then return True.

    Now let's do

    pl_df.select(pl.any_horizontal(pl.all()=="?"))
    
    shape: (3, 1)
    ┌───────┐
    │ any   │
    │ ---   │
    │ bool  │
    ╞═══════╡
    │ false │
    │ true  │
    │ true  │
    └───────┘
    

    When we call pl.any_horizontal(...) then it is going to do that rowwise for whatever ... is.

    Lastly, if we put them together...

    pl_df.select(pl.any_horizontal(pl.all()=="?").any())
    
    shape: (1, 1)
    ┌──────┐
    │ any  │
    │ ---  │
    │ bool │
    ╞══════╡
    │ true │
    └──────┘
    

    then we get the single value indicating that somewhere in the dataframe is an item that is equal to "?"