Search code examples
pythonnumpypandasmax

Python: Running maximum by another column?


I have a dataframe like this, which tracks the value of certain items (ids) over time:

mytime=np.tile( np.arange(0,10) , 2 )
myids=np.repeat( [123,456], [10,10] )
myvalues=np.random.random_integers(20,30,10*2)

df=pd.DataFrame()
df['myids']=myids
df['mytime']=mytime
df['myvalues']=myvalues



+-------+--------+----------+--+--+
| myids | mytime | myvalues |  |  |
+-------+--------+----------+--+--+
| 123   | 0      | 29       |  |  |
+-------+--------+----------+--+--+
| 123   | 1      | 23       |  |  |
+-------+--------+----------+--+--+
| 123   | 2      | 26       |  |  |
+-------+--------+----------+--+--+
| 123   | 3      | 24       |  |  |
+-------+--------+----------+--+--+
| 123   | 4      | 25       |  |  |
+-------+--------+----------+--+--+
| 123   | 5      | 29       |  |  |
+-------+--------+----------+--+--+
| 123   | 6      | 28       |  |  |
+-------+--------+----------+--+--+
| 123   | 7      | 21       |  |  |
+-------+--------+----------+--+--+
| 123   | 8      | 20       |  |  |
+-------+--------+----------+--+--+
| 123   | 9      | 26       |  |  |
+-------+--------+----------+--+--+
| 456   | 0      | 26       |  |  |
+-------+--------+----------+--+--+
| 456   | 1      | 24       |  |  |
+-------+--------+----------+--+--+
| 456   | 2      | 20       |  |  |
+-------+--------+----------+--+--+
| 456   | 3      | 26       |  |  |
+-------+--------+----------+--+--+
| 456   | 4      | 29       |  |  |
+-------+--------+----------+--+--+
| 456   | 5      | 29       |  |  |
+-------+--------+----------+--+--+
| 456   | 6      | 24       |  |  |
+-------+--------+----------+--+--+
| 456   | 7      | 21       |  |  |
+-------+--------+----------+--+--+
| 456   | 8      | 27       |  |  |
+-------+--------+----------+--+--+
| 456   | 9      | 29       |  |  |
+-------+--------+----------+--+--+

I'd need to calculate the running maximum for each id.

np.maximum.accumulate()

would calculate the running maximum regardless of id, whereas I need a similar calculation, which however resets every time the id changes. I can think of a simple script to do it in numba (I have very large arrays and non-vectorised non-numba code would be slow), but is there an easier way to do it?

With just two values I can run:

df['running max']= np.hstack((  np.maximum.accumulate(df[ df['myids']==123 ]['myvalues']) , np.maximum.accumulate(df[ df['myids']==456 ]['myvalues']) )  )

but this is not feasible with lots and lots of values.


Solution

  • Here you go. Assumption is mytime is sorted.

    mytime=np.tile( np.arange(0,10) , 2 )
    myids=np.repeat( [123,456], [10,10] )
    myvalues=np.random.random_integers(20,30,10*2)
    
    df=pd.DataFrame()
    df['myids']=myids
    df['mytime']=mytime
    df['myvalues']=myvalues
    
    groups = df.groupby('myids')
    df['run_max_group'] = groups['myvalues'].transform(np.maximum.accumulate)
    

    Output...

        myids  mytime  myvalues  run_max_group
    0     123       0        27             27
    1     123       1        21             27
    2     123       2        24             27
    3     123       3        25             27
    4     123       4        22             27
    5     123       5        20             27
    6     123       6        20             27
    7     123       7        30             30
    8     123       8        24             30
    9     123       9        22             30
    10    456       0        29             29
    11    456       1        23             29
    12    456       2        30             30
    13    456       3        28             30
    14    456       4        26             30
    15    456       5        25             30
    16    456       6        28             30
    17    456       7        27             30
    18    456       8        20             30
    19    456       9        24             30