Search code examples
pythonpandasdataframe

Remove all duplicate rows except first and last in pandas


I have a signal log with a lot of redundant data that I parse with pandas. To remove all duplicate rows besides first and last I use following code:

>>> df = pd.DataFrame(
    {
        "A": [1, 2, 3, 4, 5, 6, 7, 8, 9],
        "B": [0, 0, 0, 0, 1, 1, 1, 2, 3],
    }
)
>>> df
   A  B
0  1  0
1  2  0
2  3  0
3  4  0
4  5  1
5  6  1
6  7  1
7  8  2
8  9  3
>>> df = df[~((df.B == df.B.shift()) & (df.B == df.B.shift(-1)))]
>>> df
   A  B
0  1  0
3  4  0
4  5  1
6  7  1
7  8  2
8  9  3

The log files can get pretty big, hundreds of megabytes and the app is running on AWS EC2 VPS with only 1 Gb of RAM. So if I try to parse a large file it crashes the server. So my questions are these:

  1. Does this method require 3 times the size of the file of RAM (because it "creates" 3 dataframes)?
  2. Is there a more memory efficient way to do it?

Solution

  • You could try to break the logic in successive parts to avoid having everything at once in memory. I would also avoid the ~ operation and compute directly the inverse with != and boolean OR (using De Morgan's law):

    m = df['B'] != df['B'].shift()
    m |= df['B'] != df['B'].shift(-1)
    
    out = df[m]
    

    You could also try of diff is more memory efficient than shift:

    m = df['B'].diff().ne(0)
    m |= df['B'].diff(-1).ne(0)
    
    out = df[m]
    

    Output:

       A  B
    0  1  0
    3  4  0
    4  5  1
    6  7  1
    7  8  2
    8  9  3