I am trying to do a conditional groupby and would like to know the best approach. I am trying to compute the minimum value over a period, and the period depends on the country that i want to groupby. The following code works, but there might be better ways to do this.
Any possible improvements?
df
datetime country scenario value
2023-01-01 00:00:00 FR 1 1
2023-01-01 01:00:00 FR 1 2
2023-01-01 02:00:00 FR 1 3
2023-01-01 03:00:00 FR 1 4
2023-01-01 01:00:00 DE 1 1
2023-01-01 02:00:00 DE 1 2
2023-01-01 03:00:00 DE 1 3
2023-01-01 04:00:00 DE 1 4
countries = {'DE':2,'FR':4}
df3l=[]
for sc in scenarios: # iterate over countries
df2l = []
for c in countries: # iterate over scenarios
df2 = df[ (df['scenario']==sc) & (df['country']==c)] # select country and scenario
# compute min over period per country
df2['min over period']= pd.DataFrame(df.groupby(['scenario','country',pd.Grouper(key='datetime', freq=str(countries[c])+'H')]).transform('min'))
df2l.append(df2)
df2 = pd.concat(df2l,axis=0)
df3l.append(df2)
df = pd.concat(df3l,axis=0)
Expected outcome:
df
datetime country scenario min over period
2023-01-01 00:00:00 FR 1 1
2023-01-01 01:00:00 FR 1 1
2023-01-01 02:00:00 FR 1 1
2023-01-01 03:00:00 FR 1 1
2023-01-01 00:00:00 DE 1 1
2023-01-01 01:00:00 DE 1 1
2023-01-01 02:00:00 DE 1 3
2023-01-01 03:00:00 DE 1 3
This solution operates off of the idea that if you number the rows within each country and divide each row number by the frequency number from your dictionary, the remainder when rounded up to the next integer will create a numerical grouping column that increments according to your requirements.
There are many (possibly more intuitive ways) you could create the numbering system, like np.tile or repeat or something.
import pandas as pd
countries = {'DE':2,'FR':4}
df = pd.DataFrame({'datetime': ['00:00:00', '01:00:00', '02:00:00', '03:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00'], 'country': ['FR', 'FR', 'FR', 'FR', 'DE', 'DE', 'DE', 'DE'], 'scenario': [1, 1, 1, 1, 1, 1, 1, 1], 'value': [1, 2, 3, 4, 1, 2, 3, 4]})
df['min_over_period'] = (
df.groupby(['country',
(df.groupby('country').cumcount()+1)
.div(df.country
.map(countries))
.apply(np.ceil)])['value'].transform('min')
)
print(df)
Output
datetime country scenario value min_over_period
0 00:00:00 FR 1 1 1
1 01:00:00 FR 1 2 1
2 02:00:00 FR 1 3 1
3 03:00:00 FR 1 4 1
4 01:00:00 DE 1 1 1
5 02:00:00 DE 1 2 1
6 03:00:00 DE 1 3 3
7 04:00:00 DE 1 4 3