Search code examples
pythonpandasduplicatesdata-analysisdata-manipulation

Pandas: Replace/ Change Duplicate values within a Time Range


I have a pandas data-frame where I am trying to replace/ change the duplicate values to 0 (don't want to delete the values) within a certain range of days.

So, in example given below, I want to replace duplicate values in all columns with 0 within a range of let's say 3 (the number can be changed) days. Desired result is also given below

              A   B  C

01-01-2011   2   10  0
01-02-2011   2   12  2
01-03-2011   2   10  0
01-04-2011   3   11  3
01-05-2011   5   15  0
01-06-2011   5   23  1
01-07-2011   4   21  4
01-08-2011   2   21  5
01-09-2011   1   11  0

So, the output should look like

              A   B  C

01-01-2011   2   10  0
01-02-2011   0   12  2
01-03-2011   0   0   0
01-04-2011   3   11  3
01-05-2011   5   15  0
01-06-2011   0   23  1
01-07-2011   4   21  4
01-08-2011   2   0   5
01-09-2011   1   11  0

Any help will be appreciated.


Solution

  • You can use df.shift() for this to look at a value from a row up or down (or several rows, specified by the number x in .shift(x)).

    You can use that in combination with .loc to select all rows that have a identical value to the 2 rows above and then replace it with a 0.

    Something like this should work : (edited the code to make it flexible for endless number of columns and flexible for the number of days)

    numberOfDays = 3 # number of days to compare
    
    for col in df.columns:
        for x in range(1, numberOfDays):
            df.loc[df[col] == df[col].shift(x), col] = 0
    
    print df
    

    This gives me the output:

                A   B  C
    date
    01-01-2011  2  10  0
    01-02-2011  0  12  2
    01-03-2011  0   0  0
    01-04-2011  3  11  3
    01-05-2011  5  15  0
    01-06-2011  0  23  1
    01-07-2011  4  21  4
    01-08-2011  2   0  5
    01-09-2011  1  11  0