Say I have the following sample dataframe (there are about 25k rows in the real dataframe)
df = pd.DataFrame({'A' : [0,3,2,9,1,0,4,7,3,2], 'B': [9,8,3,5,5,5,5,8,0,4]})
df
A B
0 0 9
1 3 8
2 2 3
3 9 5
4 1 5
5 0 5
6 4 5
7 7 8
8 3 0
9 2 4
For the column A I need to know how many next and previous rows are greater than current row value but less than value in column B.
So my expected output is :
A B next count previous count
0 9 2 0
3 8 0 0
2 3 0 1
9 5 0 0
1 5 0 0
0 5 2 1
4 5 1 0
7 8 0 0
3 0 0 2
2 4 0 0
Explanation :
Similarly, previous count
is calculated
Note : I know how to solve this problem by looping using list comprehension or using the pandas apply method but still I won't mind a clear and concise apply
approach. I was looking for a more pandaic
approach.
Here is the apply solution, which I think is inefficient. Also, as people said that there might be no vector
solution for the question. So as mentioned, a more efficient apply solution will be accepted for this question.
This is what I have tried.
This function gets the number of previous/next rows that satisfy the condition.
def get_prev_next_count(row):
next_nrow = df.loc[row['index']+1:,['A', 'B']]
prev_nrow = df.loc[:row['index']-1,['A', 'B']][::-1]
if (next_nrow.size == 0):
return 0, ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin()
if (prev_nrow.size == 0):
return ((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), 0
return (((next_nrow.A > row.A) & (next_nrow.A < next_nrow.B)).argmin(), ((prev_nrow.A > row.A) & (prev_nrow.A < prev_nrow.B)).argmin())
df[['next count', 'previous count']] = df.reset_index().apply(get_prev_next_count, axis=1, result_type="expand")
This gives us the expected output
df
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
I made some optimizations:
reset_index()
you can access the index with .name
df[['A']]
instead of the whole frame, that may help.prev_nrow.empty
is the same as (prev_nrow.size == 0)
first_false
, this speeds things up significantly.def first_false(val1, val2, A):
i = 0
for x, y in zip(val1, val2):
if A < x < y:
i += 1
else:
break
return i
def get_prev_next_count(row):
A = row['A']
next_nrow = df.loc[row.name+1:,['A', 'B']]
prev_nrow = df2.loc[row.name-1:,['A', 'B']]
if next_nrow.empty:
return 0, first_false(prev_nrow.A, prev_nrow.B, A)
if prev_nrow.empty:
return first_false(next_nrow.A, next_nrow.B, A), 0
return (first_false(next_nrow.A, next_nrow.B, A),
first_false(prev_nrow.A, prev_nrow.B, A))
df2 = df[::-1].copy() # Shave a tiny bit of time by only reversing it once~
df[['next count', 'previous count']] = df[['A']].apply(get_prev_next_count, axis=1, result_type='expand')
print(df)
Output:
A B next count previous count
0 0 9 2 0
1 3 8 0 0
2 2 3 0 1
3 9 5 0 0
4 1 5 0 0
5 0 5 2 1
6 4 5 1 0
7 7 8 0 0
8 3 0 0 2
9 2 4 0 0
Expanding the data:
df = pd.concat([df]*(10000//4), ignore_index=True)
# df.shape == (25000, 2)
Original Method:
New Method:
Throw pandarallel
at it:
from pandarallel import pandarallel
pandarallel.initialize()
df[['A']].parallel_apply(get_prev_next_count, axis=1, result_type='expand')