Having dataframe
import pandas as pd
df = pd.DataFrame(
{
"group0": [1, 1, 2, 2, 3, 3],
"group1": ["1", "1", "1", "2", "2", "2"],
"relevant": [True, False, False, True, True, True],
"value": [0, 1, 2, 3, 4, 5],
}
)
I wish to produce a target
target = pd.DataFrame(
{
"group0": [1, 2, 2, 3],
"group1": ["1","1", "2", "2",],
"value": [0, 2, 3, 5],
}
)
where "value"
has been chosen by
"relevant"
indices in "value"
column"value"
if no positive "relevant"
indices existThis would be produced by
def fun(x):
tmp = x["value"][x["relevant"]]
if len(tmp):
return tmp.max()
return x["value"].max()
were x
a groupby dataframe.
Is it possible to achive the desired groupby reduction efficiently?
EDIT:
with payload
from time import perf_counter()
df = pd.DataFrame(
{
"group0": np.random.randint(0, 30,size=10_000_000),
"group1": np.random.randint(0, 30,size=10_000_000),
"relevant": np.random.randint(0, 1, size=10_000_000).astype(bool),
"value": np.random.random_sample(size=10_000_000) * 1000,
}
)
start = perf_counter()
out = (df
.sort_values(by=['relevant', 'value'])
.groupby(['group0', 'group1'], as_index=False)
['value'].last()
)
end = perf_counter()
print("Sort values", end - start)
def fun(x):
tmp = x["value"][x["relevant"]]
if len(tmp):
return tmp.max()
return x["value"].max()
start = perf_counter()
out = df.groupby(["group0", "group1"]).apply(fun)
end = perf_counter()
print("Apply", end - start)
#Sort values 14.823943354000221
#Apply 1.5050544870009617
.apply
-solution got time of 1.5s. The solution with sort_values
performed with 14.82s. However, reducing sizes of the test groups with
...
"group0": np.random.randint(0, 500_000,size=10_000_000),
"group1": np.random.randint(0, 100_000,size=10_000_000),
...
led to vastly superior performance by the sort_values
solution.
(15.29s versus 1423.84s). sort_values
solution by @mozway is preferred, unless user specifically knows that data contains small group counts.
Sort the values to put True
, then highest number last and use a groupby.last
:
out = (df
.sort_values(by=['relevant', 'value'])
.groupby(['group0', 'group1'], as_index=False)
['value'].last()
)
Output:
group0 group1 value
0 1 1 0
1 2 1 2
2 2 2 3
3 3 2 5
Intermediate before aggregation:
* selected rows
group0 group1 relevant value
1 1 1 False 1
2 2 1 False 2 *
0 1 1 True 0 *
3 2 2 True 3 *
4 3 2 True 4
5 3 2 True 5 *