I have the following example dataframe:
d = {'col1': [4, 2, 8, 4, 3, 7, 6, 9, 3, 5]}
df = pd.DataFrame(data=d)
df
col1
0 4
1 2
2 8
3 4
4 3
5 7
6 6
7 9
8 3
9 5
I need to add col2
to this dataframe, and values of this new column will be set by comparing col1
values (from different rows) as described below. Each row of col2
will be set as following:
df.loc[0, "col2"]
will say how many of df.loc[1, "col1"]
, df.loc[2, "col1"]
and df.loc[3, "col1"]
are bigger than df.loc[0, "col1"]
.
df.loc[1, "col2"]
will say how many of df.loc[2, "col1"]
, df.loc[3, "col1"]
and df.loc[4, "col1"]
are bigger than df.loc[1, "col1"]
.
df.loc[2, "col2"]
will say how many of df.loc[3, "col1"]
, df.loc[4, "col1"]
and df.loc[5, "col1"]
are bigger than df.loc[2, "col1"]
.
And so on...
If there are not 3 rows
left after the index N
, col2
value will be set to -1
.
The end result will look like the following:
col1 col2
0 4 1
1 2 3
2 8 0
3 4 2
4 3 3
5 7 1
6 6 1
7 9 -1
8 3 -1
9 5 -1
I need a function that will take a dataframe as input and will return the dataframe by adding the new column as described above.
In the example above, next 3 rows are considered. But this needs to be configurable and should be an input to the function that will do the work.
Speed is important here so it is not desired to use for loops.
How can this be done in the most efficient way in Python?
You need a reversed rolling
to compare the values to the next ones:
N = 3
df['col2'] = (df.loc[::-1, 'col1']
.rolling(N+1)
.apply(lambda s: s.iloc[:-1].gt(s.iloc[-1]).sum())
.fillna(-1, downcast='infer')
)
Alternatively, using numpy.lib.stride_tricks.sliding_window_view
:
import numpy as np
from numpy.lib.stride_tricks import sliding_window_view as swv
N = 3
df['col2'] = np.r_[(df['col1'].to_numpy()[:-N, None]
< swv(df['col1'], N)[1:] # broadcasted comparison
).sum(axis=1), # count True per row
-np.ones(N, dtype=int)] # add missing -1
Output:
col1 col2
0 4 1
1 2 3
2 8 0
3 4 2
4 3 3
5 7 1
6 6 1
7 9 -1
8 3 -1
9 5 -1