Search code examples
pythonpandasdataframegroup-by

Reduce by multiple columns in pandas groupby


Having dataframe

import pandas as pd

df = pd.DataFrame(
    {
        "group0": [1, 1, 2, 2, 3, 3],
        "group1": ["1", "1", "1", "2", "2", "2"],
        "relevant": [True, False, False, True, True, True],
        "value": [0, 1, 2, 3, 4, 5],
    }
)

I wish to produce a target

target = pd.DataFrame(
    {
        "group0": [1, 2, 2, 3],
        "group1": ["1","1", "2", "2",],
        "value": [0, 2, 3, 5],
    }
)

where "value" has been chosen by

  1. Maximum of all positive "relevant" indices in "value" column
  2. Otherwise maximum of "value" if no positive "relevant" indices exist

This would be produced by

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

were x a groupby dataframe.

Is it possible to achive the desired groupby reduction efficiently?

EDIT:

with payload

from time import perf_counter()
df = pd.DataFrame(
    {
        "group0": np.random.randint(0, 30,size=10_000_000),
        "group1": np.random.randint(0, 30,size=10_000_000),
        "relevant": np.random.randint(0, 1, size=10_000_000).astype(bool),
        "value": np.random.random_sample(size=10_000_000) * 1000,
    }
)

start = perf_counter()
out = (df
   .sort_values(by=['relevant', 'value'])
   .groupby(['group0', 'group1'], as_index=False)
   ['value'].last()
 )
end = perf_counter()
print("Sort values", end - start)

def fun(x):
    tmp = x["value"][x["relevant"]]
    if len(tmp):
        return tmp.max()
    return x["value"].max()

start = perf_counter()
out = df.groupby(["group0", "group1"]).apply(fun)
end = perf_counter()
print("Apply", end - start)
#Sort values 14.823943354000221
#Apply 1.5050544870009617

.apply-solution got time of 1.5s. The solution with sort_values performed with 14.82s. However, reducing sizes of the test groups with

...
        "group0": np.random.randint(0, 500_000,size=10_000_000),
        "group1": np.random.randint(0, 100_000,size=10_000_000),
...

led to vastly superior performance by the sort_values solution. (15.29s versus 1423.84s). sort_values solution by @mozway is preferred, unless user specifically knows that data contains small group counts.


Solution

  • Sort the values to put True, then highest number last and use a groupby.last:

    out = (df
       .sort_values(by=['relevant', 'value'])
       .groupby(['group0', 'group1'], as_index=False)
       ['value'].last()
     )
    

    Output:

       group0 group1  value
    0       1      1      0
    1       2      1      2
    2       2      2      3
    3       3      2      5
    

    Intermediate before aggregation:

    * selected rows

       group0 group1  relevant  value
    1       1      1     False      1
    2       2      1     False      2  *
    0       1      1      True      0  *
    3       2      2      True      3  *
    4       3      2      True      4
    5       3      2      True      5  *