I would like to find the most efficient way for grouping data by consecutive values, and create a new variable giving the value of the previous group at each group (except for the first group, where the values should be the same).
I might not be really clear at this point, so here is a toy example:
import pandas as pd
var = [1,1,1,2,2,3,1,1,2,4,4,4]
toy_df = pd.DataFrame(var, columns = ['var'])
My desired output is the following:
desired_output = pd.DataFrame(
{'var': var,
'lagged_var':[1,1,1,1,1,2,3,3,1,2,2,2]}
)
var lagged_var
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 3 2
6 1 3
7 1 3
8 2 1
9 4 2
10 4 2
11 4 2
So far I have come up with the following function which does the trick:
def make_lag(var):
groups = ( var.shift() != var ).cumsum()
var_shifted = pd.Series([0]*len(var))
for n_gp in groups.unique():
if n_gp == 1:
var_shifted[groups == n_gp] = var[groups == n_gp]
else:
var_shifted[groups == n_gp] = var[groups == n_gp - 1].iloc[0]
return ( var_shifted )
And
toy_df['lagged_values'] = toy_df.apply(lambda x: make_lag(x))
gives the desired output. However, I suspect it to be highly inefficient since it involves looping over all the rows. Does someone know a vectorized method to produce the same output?(I have to repeat this task for hundreds of very long time series, it would be really time-saving!)
Thank you very much !
Use Series.shift
for next value, replace if matching original values and then repeat values by forward and back filling missing values:
s = toy_df['var'].shift()
toy_df['new'] = s.mask(toy_df['var'].eq(s)).ffill().bfill()
print (toy_df)
var new
0 1 1.0
1 1 1.0
2 1 1.0
3 2 1.0
4 2 1.0
5 3 2.0
6 1 3.0
7 1 3.0
8 2 1.0
9 4 2.0
10 4 2.0
11 4 2.0
If want convert values to integers:
s = toy_df['var'].shift()
toy_df['new'] = s.mask(toy_df['var'].eq(s)).ffill().bfill().astype(int)
print (toy_df)
var new
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 3 2
6 1 3
7 1 3
8 2 1
9 4 2
10 4 2
11 4 2