Search code examples
pythonpandasdataframefunctiongroup-by

Get minimum through record iterations in pandas dataframe


I have created the following pandas dataframe:

import pandas as pd
import numpy as np

ds = { 'trend' : [1,1,1,1,2,2,3,3,3,3,3,3,4,4,4,4,4], 'price' : [23,43,56,21,43,55,54,32,9,12,11,12,23,3,2,1,1]}

df = pd.DataFrame(data=ds)

The dataframe looks as follows:

display(df)

    trend   price
0      1     23
1      1     43
2      1     56
3      1     21
4      2     43
5      2     55
6      3     54
7      3     32
8      3     9
9      3     12
10     3     11
11     3     12
12     4     23
13     4     3
14     4     2
15     4     1
16     4     1

I have saved the dataframe to a .csv file called df.csv:

df.to_csv("df.csv", index = False)

I need to create a new field called minimum which:

  1. iterates through each and every record of the dataframe
  2. takes the mimimum between the price observed at each iteration and the last price observed in the previous trend.

For example:

  • I iterate at record 0 and the minimum price is 23 (there is only that one).
  • I iterate at record 1 and take the minimum between 43 and 23: the result is 23.

Fast forward to record 4.

  • I need to calculate the minimum between the price observed at record 4 (price: 43) and the last price observed for the previous trend (price: 21). The result is 21.

Fast forward to record 14.

  • I need to calculate the minimum between the price observed at record 14 (price: 2) and the last price observed for the previous trend (price: 12). The result is 2.

And so on.

I have then written this code:

minimum = []

for i in range(len(df)):
    ds = pd.read_csv("df.csv", nrows=i+1)
    d = ds.groupby('trend', as_index=False).agg(
                                                    {'price':'last'})

    d['minimum'] = d['price'].min()
    minimum.append(d['minimum'].iloc[-1])


ds['minimum'] = minimum

The resulting dataframe looks as follows:

display(ds)

        trend   price   minimum
0          1     23         23
1          1     43         43
2          1     56         56
3          1     21         21
4          2     43         21
5          2     55         21
6          3     54         21
7          3     32         21
8          3      9          9
9          3     12         12
10         3     11         11
11         3     12         12
12         4     23         12
13         4      3          3
14         4      2          2
15         4      1          1
16         4      1          1

The resulting dataframe is correct.

The problem is that I have to apply this process to a dataframe which contains about 1 million records and it will take about 48 years to complete.

Does anybody know a quicker way to obtain the same results above?


Solution

  • Aggregate, shift, then map and get the minimum with :

    prev = (df.groupby('trend')['price'].min()
              .shift(fill_value=np.inf)
            )
    
    df['minimum'] = np.minimum(df['price'], df['trend'].map(prev))
    

    Output:

        trend  price  minimum
    0       1     23       23
    1       1     43       43
    2       1     56       56
    3       1     21       21
    4       2     43       21
    5       2     55       21
    6       3     54       43
    7       3     32       32
    8       3      9        9
    9       3     12       12
    10      3     11       11
    11      3     12       12
    12      4     23        9
    13      4      3        3
    14      4      2        2
    15      4      1        1
    16      4      1        1