I have a dataframe:
df = c1 c2 c3 code
1. 2. 3. 200
1. 5. 7. 220
1. 2. 3. 200
2. 4. 1. 340
6. 1. 1. 370
6. 1. 5. 270
9. 8. 2. 300
1. 6. 9. 700
9. 2. 1. 200
8. 1. 2 400
1. 2 1. 200
2. 5. 3 900
8. 0. 4. 300
9. 1. 2. 620
I want to take only the rows that are between any row with 300 code to its previous 200 code. So here I will have
df. c1 c2 c3 code batch_num
1. 2. 3. 200. 0
2. 4. 1. 340. 0
6. 1. 1. 370. 0
6. 1. 5. 270. 0
9. 8. 2. 300. 0
1. 2 1. 200. 1
2. 5. 3 900. 1
8. 0. 4. 300. 1
So basically what I need is to: find each 300, and for each - find the nearest previous 200, and take the rows between them. It is guaranteed that there will always be at least one 200 before each 300. Than, add a columns that indicate the proper batch. How can I do it efficiently in pandas?
You can use:
# rows after 200 are True
m1 = df['code'].map({200: True, 300: False}).ffill()
# rows before 300 are True
m2 = df['code'].map({300: True, 200: False}).bfill()
# if both conditions True, expand to ± 1
m = (m1&m2).rolling(3, min_periods=1, center=True).max().astype(bool)
# select
out = df[m]
# add batch
out['batch_num'] = out['code'].eq(200).cumsum().sub(1)
output:
c1 c2 c3 code batch_num
2 1.0 2.0 3.0 200 0
3 2.0 4.0 1.0 340 0
4 9.0 8.0 2.0 300 0
8 1.0 2.0 1.0 200 1
9 2.0 5.0 3.0 900 1
10 8.0 0.0 4.0 300 1
Intermediates:
c1 c2 c3 code m1 m2 m1&m2 rolling_max
0 1.0 2.0 3.0 200 True False False False
1 1.0 5.0 7.0 220 True False False False
2 1.0 2.0 3.0 200 True False False True
3 2.0 4.0 1.0 340 True True True True
4 9.0 8.0 2.0 300 False True False True
5 1.0 6.0 9.0 700 False False False False
6 9.0 2.0 1.0 200 True False False False
7 8.0 1.0 2.0 400 True False False False
8 1.0 2.0 1.0 200 True False False True
9 2.0 5.0 3.0 900 True True True True
10 8.0 0.0 4.0 300 False True False True
11 9.0 1.0 2.0 620 False NaN False False
m = m1&m2
m = m|m.shift()|m.shift(-1)
out = df[m]
out['batch_num'] = out['code'].eq(200).cumsum().sub(1)