Search code examples
pythonpandasgroupinglag

How to get the lagged values of a variable based on groups with pandas?


I would like to find the most efficient way for grouping data by consecutive values, and create a new variable giving the value of the previous group at each group (except for the first group, where the values should be the same).

I might not be really clear at this point, so here is a toy example:

import pandas as pd

var = [1,1,1,2,2,3,1,1,2,4,4,4]
toy_df = pd.DataFrame(var, columns = ['var'])

My desired output is the following:

desired_output = pd.DataFrame(
                {'var': var,
                 'lagged_var':[1,1,1,1,1,2,3,3,1,2,2,2]}
                )

    var  lagged_var
0     1           1
1     1           1
2     1           1
3     2           1
4     2           1
5     3           2
6     1           3
7     1           3
8     2           1
9     4           2
10    4           2
11    4           2

So far I have come up with the following function which does the trick:

def make_lag(var):
    groups = ( var.shift() != var ).cumsum()
    var_shifted = pd.Series([0]*len(var))
    for n_gp in groups.unique():
        if n_gp == 1: 
            var_shifted[groups == n_gp] = var[groups == n_gp]
        else:
            var_shifted[groups == n_gp] = var[groups == n_gp - 1].iloc[0]
    
    return ( var_shifted )

And

toy_df['lagged_values'] = toy_df.apply(lambda x: make_lag(x))

gives the desired output. However, I suspect it to be highly inefficient since it involves looping over all the rows. Does someone know a vectorized method to produce the same output?(I have to repeat this task for hundreds of very long time series, it would be really time-saving!)

Thank you very much !


Solution

  • Use Series.shift for next value, replace if matching original values and then repeat values by forward and back filling missing values:

    s = toy_df['var'].shift()
    toy_df['new'] = s.mask(toy_df['var'].eq(s)).ffill().bfill()
    print (toy_df)
        var  new
    0     1  1.0
    1     1  1.0
    2     1  1.0
    3     2  1.0
    4     2  1.0
    5     3  2.0
    6     1  3.0
    7     1  3.0
    8     2  1.0
    9     4  2.0
    10    4  2.0
    11    4  2.0
    

    If want convert values to integers:

    s = toy_df['var'].shift()
    toy_df['new'] = s.mask(toy_df['var'].eq(s)).ffill().bfill().astype(int)
    print (toy_df)
     var  new
    0     1    1
    1     1    1
    2     1    1
    3     2    1
    4     2    1
    5     3    2
    6     1    3
    7     1    3
    8     2    1
    9     4    2
    10    4    2
    11    4    2