I have a very large dataframe (about 5 billion rows and 13 columns). One of the columns can have values of 0 and 1, representing inactive and active mode. There are usually several rows of inactive mode (0), followed by rows of active mode (1). So first there is an inactive segment, followed by an active segment, again followed by an inactive segment and so forth.
Something like this:
df = pd.DataFrame(np.random.randint(0,100,size=(30, 4)), columns=list('ABCD'))
df['E'] = [0,0,0,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0]
A B C D E
0 53 27 10 49 0
1 39 46 89 33 0
2 32 43 7 15 0
3 16 36 33 18 0
4 12 3 94 96 1
5 31 22 64 51 1
6 79 93 93 67 1
7 85 27 10 19 1
8 4 92 38 38 1
9 79 8 6 90 1
10 68 6 26 24 1
11 89 49 23 91 1
12 44 17 40 75 1
13 12 18 17 87 1
14 76 67 74 15 0
15 9 37 8 92 0
16 52 53 93 47 0
17 74 3 27 96 0
18 77 10 93 8 0
19 64 33 46 79 0
20 72 9 36 99 0
21 75 13 76 49 1
22 95 21 88 77 1
23 46 45 36 20 1
24 61 26 72 48 1
25 35 62 98 28 1
26 91 43 23 42 0
27 25 59 6 30 0
28 20 31 2 76 0
29 2 25 86 12 0
I need to delete the first 20% of every active segment, the inactive segments should stay as they are. So it should look like this:
A B C D E
0 53 27 10 49 0
1 39 46 89 33 0
2 32 43 7 15 0
3 16 36 33 18 0
6 79 93 93 67 1
7 85 27 10 19 1
8 4 92 38 38 1
9 79 8 6 90 1
10 68 6 26 24 1
11 89 49 23 91 1
12 44 17 40 75 1
13 12 18 17 87 1
14 76 67 74 15 0
15 9 37 8 92 0
16 52 53 93 47 0
17 74 3 27 96 0
18 77 10 93 8 0
19 64 33 46 79 0
20 72 9 36 99 0
22 95 21 88 77 1
23 46 45 36 20 1
24 61 26 72 48 1
25 35 62 98 28 1
26 91 43 23 42 0
27 25 59 6 30 0
28 20 31 2 76 0
29 2 25 86 12 0
I already added an additional column using df.diff() that shows where the change from inactive to active and vice versa occurs:
A B C D E E_diff
0 53 27 10 49 0 NaN
1 39 46 89 33 0 0.0
2 32 43 7 15 0 0.0
3 16 36 33 18 0 0.0
4 12 3 94 96 1 1.0
5 31 22 64 51 1 0.0
6 79 93 93 67 1 0.0
7 85 27 10 19 1 0.0
8 4 92 38 38 1 0.0
9 79 8 6 90 1 0.0
10 68 6 26 24 1 0.0
11 89 49 23 91 1 0.0
12 44 17 40 75 1 0.0
13 12 18 17 87 1 0.0
14 76 67 74 15 0 -1.0
15 9 37 8 92 0 0.0
16 52 53 93 47 0 0.0
17 74 3 27 96 0 0.0
18 77 10 93 8 0 0.0
19 64 33 46 79 0 0.0
20 72 9 36 99 0 0.0
21 75 13 76 49 1 1.0
22 95 21 88 77 1 0.0
23 46 45 36 20 1 0.0
24 61 26 72 48 1 0.0
25 35 62 98 28 1 0.0
26 91 43 23 42 0 -1.0
27 25 59 6 30 0 0.0
28 20 31 2 76 0 0.0
29 2 25 86 12 0 0.0
I can't figure out how to determine the length of the segments and then delete the correct rows. Do I need to write a loop for that or is there an easier way? And how can I determine the correct index values for the rows to drop?
This does the job. It's a patchwork of two separate StackOverflow answers: stackoverflow.com/a/45886579/7483211 and https://stackoverflow.com/a/31226290/7483211:
df['flag'] = df.E.diff().ne(0).cumsum()
def mask_first(x,ratio=0.2):
result = np.ones_like(x,dtype=bool)
result[0:math.ceil(len(x)*ratio)] = False
return result
mask = df.groupby(['flag'])['flag'].transform(mask_first).astype(bool)
df[(df.E == 0) | mask].drop('flag',axis=1)
The output is:
idx A B C D E
0 77 44 52 17 0
1 75 61 69 6 0
2 79 87 99 31 0
3 42 10 3 98 0
6 53 68 86 92 1
7 25 3 27 84 1
8 18 60 93 79 1
...