Search code examples
pythonpandasgroup-by

pandas conditional groupby on a group


I am trying to do a conditional groupby and would like to know the best approach. I am trying to compute the minimum value over a period, and the period depends on the country that i want to groupby. The following code works, but there might be better ways to do this.

Any possible improvements?

df
datetime                country  scenario  value
2023-01-01 00:00:00         FR      1         1  
2023-01-01 01:00:00         FR      1         2  
2023-01-01 02:00:00         FR      1         3  
2023-01-01 03:00:00         FR      1         4  
2023-01-01 01:00:00         DE      1         1  
2023-01-01 02:00:00         DE      1         2 
2023-01-01 03:00:00         DE      1         3  
2023-01-01 04:00:00         DE      1         4  

 countries = {'DE':2,'FR':4}

df3l=[]
for sc in scenarios: # iterate over countries
    df2l = []
    for c in countries: # iterate over scenarios
        df2 = df[ (df['scenario']==sc) & (df['country']==c)] # select country and scenario
        # compute min over period per country
        df2['min over period']= pd.DataFrame(df.groupby(['scenario','country',pd.Grouper(key='datetime', freq=str(countries[c])+'H')]).transform('min'))
        df2l.append(df2)
    df2 = pd.concat(df2l,axis=0)
    df3l.append(df2)
df = pd.concat(df3l,axis=0)

Expected outcome:

 df
    datetime                country  scenario  min over period
    2023-01-01 00:00:00         FR      1         1  
    2023-01-01 01:00:00         FR      1         1  
    2023-01-01 02:00:00         FR      1         1  
    2023-01-01 03:00:00         FR      1         1  
    2023-01-01 00:00:00         DE      1         1  
    2023-01-01 01:00:00         DE      1         1 
    2023-01-01 02:00:00         DE      1         3  
    2023-01-01 03:00:00         DE      1         3  

Solution

  • This solution operates off of the idea that if you number the rows within each country and divide each row number by the frequency number from your dictionary, the remainder when rounded up to the next integer will create a numerical grouping column that increments according to your requirements.

    There are many (possibly more intuitive ways) you could create the numbering system, like np.tile or repeat or something.

    import pandas as pd
    countries = {'DE':2,'FR':4}
    df = pd.DataFrame({'datetime': ['00:00:00', '01:00:00', '02:00:00', '03:00:00', '01:00:00', '02:00:00', '03:00:00', '04:00:00'], 'country': ['FR', 'FR', 'FR', 'FR', 'DE', 'DE', 'DE', 'DE'], 'scenario': [1, 1, 1, 1, 1, 1, 1, 1], 'value': [1, 2, 3, 4, 1, 2, 3, 4]})
    
    
    df['min_over_period'] = (
      df.groupby(['country',
                  (df.groupby('country').cumcount()+1)
                     .div(df.country
                            .map(countries))
                     .apply(np.ceil)])['value'].transform('min')
    )
      
    
    print(df)
    

    Output

      datetime country  scenario  value  min_over_period
    0  00:00:00      FR         1      1                1
    1  01:00:00      FR         1      2                1
    2  02:00:00      FR         1      3                1
    3  03:00:00      FR         1      4                1
    4  01:00:00      DE         1      1                1
    5  02:00:00      DE         1      2                1
    6  03:00:00      DE         1      3                3
    7  04:00:00      DE         1      4                3