Search code examples
group-bycountpython-polars

Counting all "-1" and -1 in a dataframe for a list of certain columns


I have a polars dataframe, and for some of the columns I want to count the number of "-1" (if character) and -1 (if numeric). I would like to really make this a fast query, so I'm very interested in learning proper polars code such that it runs very fast.

import polars as pl
import polars.selectors as cs
import numpy as np

df = pl.DataFrame( {"int_1": [1, 2, 3, -1, 5], 
                    "char_1": ["foo", "ham", "spam", "egg", "-1"], 
                    "not_of_interest": np.random.rand(5),
                    "groups": ["A", "A", "B", "C", "B"],
                    "int_2": [12, 12, 13, 14, 15]
                    } 
)

Desired dataframe looks like this:

shape: (3, 4)
┌────────┬───────┬───────┬────────┐
│ groups ┆ int_1 ┆ int_2 ┆ char_1 │
│ ---    ┆ ---   ┆ ---   ┆ ---    │
│ str    ┆ i64   ┆ i64   ┆ i64    │
╞════════╪═══════╪═══════╪════════╡
│ B      ┆ 0     ┆ 0     ┆ 1      │
│ C      ┆ 1     ┆ 0     ┆ 0      │
│ A      ┆ 0     ┆ 0     ┆ 0      │
└────────┴───────┴───────┴────────┘

My attempt thus far:

df.select(cs.numeric() | cs.by_name("groups")) \
.group_by("groups").agg((pl.col("*") == pl.lit(-1)).sum())

df.select(cs.string() | cs.by_name("groups")) \
.group_by("groups").agg((pl.col("*") == pl.lit("-1")).sum())

Then I might join these two tables, but I want to do this in one query.

I would also like to define two lists, one for what variables to group on, and another to be the set of variables that I am interested in finding -1s for, such that I don't have to hardcode "groups", but I am really not sure if what I'm doing is good code or not:

cols_to_check = ['int_1', 'int_2', 'char_1']
group_cols = ['groups']

df.select((cs.string() & cs.by_name(*cols_to_check)) | cs.by_name(*group_cols)) \
.group_by(group_cols).agg((pl.col("*") == pl.lit("-1")).sum())

df.select((cs.numeric() & cs.by_name(*cols_to_check)) | cs.by_name(*group_cols)) \
.group_by(group_cols).agg((pl.col("*") == pl.lit(-1)).sum())

EDIT: I think I have managed it.

# attempt:
df.group_by(group_cols).agg(((cs.numeric() & cs.by_name(*[cols_to_check])) == -1).sum(), 
                            ((cs.string() & cs.by_name(*[cols_to_check])) == "-1").sum())

I honestly didn't know this would work. I guess I lack understanding of when column selectors can be used...


Solution

  • As you are testing for the "same value", another approach could be to .cast() all the columns.

    Casting to int:

    (df.group_by(group_cols)
       .agg((pl.col(cols_to_check).cast(int, strict=False) == -1).sum())
    )
    
    shape: (3, 4)
    ┌────────┬───────┬───────┬────────┐
    │ groups ┆ int_1 ┆ int_2 ┆ char_1 │
    │ ---    ┆ ---   ┆ ---   ┆ ---    │
    │ str    ┆ u32   ┆ u32   ┆ u32    │
    ╞════════╪═══════╪═══════╪════════╡
    │ A      ┆ 0     ┆ 0     ┆ 0      │
    │ B      ┆ 0     ┆ 0     ┆ 1      │
    │ C      ┆ 1     ┆ 0     ┆ 0      │
    └────────┴───────┴───────┴────────┘
    

    Casting to string:

    (df.group_by(group_cols)
       .agg((pl.col(cols_to_check).cast(pl.String) == "-1").sum())
    )