Search code examples
pythonrpython-3.xlag

Lag Function usage in python to shift the data and create new variables


I have a dataset as below:

The 'FIRST' and 'SECOND' variable has eight combination each(which can change) an the 'SCORE' has a value for each combination

     FIRST  SECOND  SCORE
0       1       1  38147
1       2       1    567
2       3       1      0
3       4       1      0
4       5       1      0
5       6       1      0
6       7       1      0
7       8       1      0
8       1       2   3099
9       2       2  30460
10      3       2   2372
11      4       2      0
12      5       2      0
13      6       2      0
14      7       2      0
15      8       2    267

I wanted to create 8 new variables with suffix as 'S' (the number of new variables is same as number of unique values in 'FIRST' or 'SECOND' and the shift the value of 'SCORE' with one lag in each new variables.

DESIRED OUTPUT

      FIRST  SECOND  SCORE     S1       S2       S3       S4       S5      S6      S7   S8
0       1       1  38147      NaN      NaN      NaN      NaN      NaN      NaN      NaN NaN
1       2       1    567  38147.0      NaN      NaN      NaN      NaN      NaN      NaN NaN
2       3       1      0    567.0  38147.0      NaN      NaN      NaN      NaN      NaN NaN
3       4       1      0      0.0    567.0  38147.0      NaN      NaN      NaN      NaN NaN
4       5       1      0      0.0      0.0    567.0  38147.0      NaN      NaN      NaN NaN
5       6       1      0      0.0      0.0      0.0    567.0  38147.0      NaN      NaN NaN
6       7       1      0      0.0      0.0      0.0      0.0    567.0  38147.0      NaN NaN
7       8       1      0      0.0      0.0      0.0      0.0      0.0    567.0  38147.0 NaN
8       1       2   3099      NaN      NaN      NaN      NaN      NaN      NaN      NaN NaN
9       2       2  30460   3099.0      NaN      NaN      NaN      NaN      NaN      NaN NaN
10      3       2   2372  30460.0   3099.0      NaN      NaN      NaN      NaN      NaN NaN
11      4       2      0   2372.0  30460.0   3099.0      NaN      NaN      NaN      NaN NaN
12      5       2      0      0.0   2372.0  30460.0   3099.0      NaN      NaN      NaN NaN
13      6       2      0      0.0      0.0   2372.0  30460.0   3099.0      NaN      NaN NaN
14      7       2      0      0.0      0.0      0.0   2372.0  30460.0   3099.0      NaN NaN
15      8       2    267      0.0      0.0      0.0      0.0   2372.0  30460.0   3099.0 NaN

Is there a way to accomplish this in python?

Below is equivalent code in R language which accomplishes this using data.table package

R code

number_of_bids <- 8

test[, paste0('S',seq(1:number_of_bids)):= shift(SCORE,1:number_of_bids, NA_real_,"lag"), by = SECOND]

Below is the input python data for reference:


#Ceate dataframe
data = [
    [1,1,38147],
    [2,1,567],
    [3,1,0],
    [4,1,0],
    [5,1,0],
    [6,1,0],
    [7,1,0],
    [8,1,0],
    [1,2,3099],
    [2,2,30460],
    [3,2,2372],
    [4,2,0],
    [5,2,0],
    [6,2,0],
    [7,2,0],
    [8,2,267],
       
]

#dataframe
test1 = pd.DataFrame(data, columns = ['FIRST', 'SECOND', 'SCORE'])

Solution

  • Two methods:

    1. .shift and pd.concat:

      In [324]: pd.concat([dat,dat.groupby(['SECOND'])['SCORE'].shift(), dat.groupby(['SECOND'])['SCORE'].shift(2)], axis=1)
      Out[324]: 
          FIRST  SECOND  SCORE    SCORE    SCORE
      0       1       1  38147      NaN      NaN
      1       2       1    567  38147.0      NaN
      2       3       1      0    567.0  38147.0
      3       4       1      0      0.0    567.0
      4       5       1      0      0.0      0.0
      5       6       1      0      0.0      0.0
      6       7       1      0      0.0      0.0
      7       8       1      0      0.0      0.0
      8       1       2   3099      NaN      NaN
      9       2       2  30460   3099.0      NaN
      10      3       2   2372  30460.0   3099.0
      11      4       2      0   2372.0  30460.0
      12      5       2      0      0.0   2372.0
      13      6       2      0      0.0      0.0
      14      7       2      0      0.0      0.0
      15      8       2    267      0.0      0.0
      

      (and rename the subsequent columns).

    2. for loop and .shift:

      In [337]: for lag in range(1, 5):
           ...:   dat['S' + str(lag)] = dat.groupby(['SECOND'])['SCORE'].shift(lag)
           ...: 
      
      In [338]: dat
      Out[338]: 
          FIRST  SECOND  SCORE       S1       S2       S3       S4
      0       1       1  38147      NaN      NaN      NaN      NaN
      1       2       1    567  38147.0      NaN      NaN      NaN
      2       3       1      0    567.0  38147.0      NaN      NaN
      3       4       1      0      0.0    567.0  38147.0      NaN
      4       5       1      0      0.0      0.0    567.0  38147.0
      5       6       1      0      0.0      0.0      0.0    567.0
      6       7       1      0      0.0      0.0      0.0      0.0
      7       8       1      0      0.0      0.0      0.0      0.0
      8       1       2   3099      NaN      NaN      NaN      NaN
      9       2       2  30460   3099.0      NaN      NaN      NaN
      10      3       2   2372  30460.0   3099.0      NaN      NaN
      11      4       2      0   2372.0  30460.0   3099.0      NaN
      12      5       2      0      0.0   2372.0  30460.0   3099.0
      13      6       2      0      0.0      0.0   2372.0  30460.0
      14      7       2      0      0.0      0.0      0.0   2372.0
      15      8       2    267      0.0      0.0      0.0      0.0
      

    Data

    import pandas as pd
    dat = pd.DataFrame({
      'FIRST': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8],dtype='int64',index=pd.Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64')),
      'SECOND': pd.Series([1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],dtype='int64',index=pd.Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64')), 
      'SCORE': pd.Series([38147, 567, 0, 0, 0, 0, 0, 0, 3099, 30460, 2372, 0, 0, 0, 0, 267],dtype='int64',index=pd.Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64'))
    }, index=pd.Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64'))