I have a signal log with a lot of redundant data that I parse with pandas. To remove all duplicate rows besides first and last I use following code:
>>> df = pd.DataFrame(
{
"A": [1, 2, 3, 4, 5, 6, 7, 8, 9],
"B": [0, 0, 0, 0, 1, 1, 1, 2, 3],
}
)
>>> df
A B
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 1
6 7 1
7 8 2
8 9 3
>>> df = df[~((df.B == df.B.shift()) & (df.B == df.B.shift(-1)))]
>>> df
A B
0 1 0
3 4 0
4 5 1
6 7 1
7 8 2
8 9 3
The log files can get pretty big, hundreds of megabytes and the app is running on AWS EC2 VPS with only 1 Gb of RAM. So if I try to parse a large file it crashes the server. So my questions are these:
You could try to break the logic in successive parts to avoid having everything at once in memory. I would also avoid the ~
operation and compute directly the inverse with !=
and boolean OR (using De Morgan's law):
m = df['B'] != df['B'].shift()
m |= df['B'] != df['B'].shift(-1)
out = df[m]
You could also try of diff
is more memory efficient than shift
:
m = df['B'].diff().ne(0)
m |= df['B'].diff(-1).ne(0)
out = df[m]
Output:
A B
0 1 0
3 4 0
4 5 1
6 7 1
7 8 2
8 9 3