I have a dataset as below:
The 'FIRST' and 'SECOND' variable has eight combination each(which can change) an the 'SCORE' has a value for each combination
FIRST SECOND SCORE
0 1 1 38147
1 2 1 567
2 3 1 0
3 4 1 0
4 5 1 0
5 6 1 0
6 7 1 0
7 8 1 0
8 1 2 3099
9 2 2 30460
10 3 2 2372
11 4 2 0
12 5 2 0
13 6 2 0
14 7 2 0
15 8 2 267
I wanted to create 8 new variables with suffix as 'S' (the number of new variables is same as number of unique values in 'FIRST' or 'SECOND' and the shift the value of 'SCORE' with one lag in each new variables.
DESIRED OUTPUT
FIRST SECOND SCORE S1 S2 S3 S4 S5 S6 S7 S8
0 1 1 38147 NaN NaN NaN NaN NaN NaN NaN NaN
1 2 1 567 38147.0 NaN NaN NaN NaN NaN NaN NaN
2 3 1 0 567.0 38147.0 NaN NaN NaN NaN NaN NaN
3 4 1 0 0.0 567.0 38147.0 NaN NaN NaN NaN NaN
4 5 1 0 0.0 0.0 567.0 38147.0 NaN NaN NaN NaN
5 6 1 0 0.0 0.0 0.0 567.0 38147.0 NaN NaN NaN
6 7 1 0 0.0 0.0 0.0 0.0 567.0 38147.0 NaN NaN
7 8 1 0 0.0 0.0 0.0 0.0 0.0 567.0 38147.0 NaN
8 1 2 3099 NaN NaN NaN NaN NaN NaN NaN NaN
9 2 2 30460 3099.0 NaN NaN NaN NaN NaN NaN NaN
10 3 2 2372 30460.0 3099.0 NaN NaN NaN NaN NaN NaN
11 4 2 0 2372.0 30460.0 3099.0 NaN NaN NaN NaN NaN
12 5 2 0 0.0 2372.0 30460.0 3099.0 NaN NaN NaN NaN
13 6 2 0 0.0 0.0 2372.0 30460.0 3099.0 NaN NaN NaN
14 7 2 0 0.0 0.0 0.0 2372.0 30460.0 3099.0 NaN NaN
15 8 2 267 0.0 0.0 0.0 0.0 2372.0 30460.0 3099.0 NaN
Is there a way to accomplish this in python?
Below is equivalent code in R language which accomplishes this using data.table package
R code
number_of_bids <- 8
test[, paste0('S',seq(1:number_of_bids)):= shift(SCORE,1:number_of_bids, NA_real_,"lag"), by = SECOND]
Below is the input python data for reference:
#Ceate dataframe
data = [
[1,1,38147],
[2,1,567],
[3,1,0],
[4,1,0],
[5,1,0],
[6,1,0],
[7,1,0],
[8,1,0],
[1,2,3099],
[2,2,30460],
[3,2,2372],
[4,2,0],
[5,2,0],
[6,2,0],
[7,2,0],
[8,2,267],
]
#dataframe
test1 = pd.DataFrame(data, columns = ['FIRST', 'SECOND', 'SCORE'])
Two methods:
.shift
and pd.concat
:
In [324]: pd.concat([dat,dat.groupby(['SECOND'])['SCORE'].shift(), dat.groupby(['SECOND'])['SCORE'].shift(2)], axis=1)
Out[324]:
FIRST SECOND SCORE SCORE SCORE
0 1 1 38147 NaN NaN
1 2 1 567 38147.0 NaN
2 3 1 0 567.0 38147.0
3 4 1 0 0.0 567.0
4 5 1 0 0.0 0.0
5 6 1 0 0.0 0.0
6 7 1 0 0.0 0.0
7 8 1 0 0.0 0.0
8 1 2 3099 NaN NaN
9 2 2 30460 3099.0 NaN
10 3 2 2372 30460.0 3099.0
11 4 2 0 2372.0 30460.0
12 5 2 0 0.0 2372.0
13 6 2 0 0.0 0.0
14 7 2 0 0.0 0.0
15 8 2 267 0.0 0.0
(and rename the subsequent columns).
for
loop and .shift
:
In [337]: for lag in range(1, 5):
...: dat['S' + str(lag)] = dat.groupby(['SECOND'])['SCORE'].shift(lag)
...:
In [338]: dat
Out[338]:
FIRST SECOND SCORE S1 S2 S3 S4
0 1 1 38147 NaN NaN NaN NaN
1 2 1 567 38147.0 NaN NaN NaN
2 3 1 0 567.0 38147.0 NaN NaN
3 4 1 0 0.0 567.0 38147.0 NaN
4 5 1 0 0.0 0.0 567.0 38147.0
5 6 1 0 0.0 0.0 0.0 567.0
6 7 1 0 0.0 0.0 0.0 0.0
7 8 1 0 0.0 0.0 0.0 0.0
8 1 2 3099 NaN NaN NaN NaN
9 2 2 30460 3099.0 NaN NaN NaN
10 3 2 2372 30460.0 3099.0 NaN NaN
11 4 2 0 2372.0 30460.0 3099.0 NaN
12 5 2 0 0.0 2372.0 30460.0 3099.0
13 6 2 0 0.0 0.0 2372.0 30460.0
14 7 2 0 0.0 0.0 0.0 2372.0
15 8 2 267 0.0 0.0 0.0 0.0
Data
import pandas as pd
dat = pd.DataFrame({
'FIRST': pd.Series([1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8],dtype='int64',index=pd.Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64')),
'SECOND': pd.Series([1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2],dtype='int64',index=pd.Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64')),
'SCORE': pd.Series([38147, 567, 0, 0, 0, 0, 0, 0, 3099, 30460, 2372, 0, 0, 0, 0, 267],dtype='int64',index=pd.Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64'))
}, index=pd.Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], dtype='int64'))