Search code examples
pythonpandas

Delete all columns for which value repents consecutively more than 3 times


I have adf that looks like this:

date stock1 stock2 stock3 stock4 stock5 stock6 stock7 stock8 stock9 stock10
10/20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9
11/20 0.1 0.9 0.3 0.4 0.3 0.5 0.3 0.2 0.4 0.1
12/20 0.1 0.6 0.9 0.5 0.6 0.7 0.8 0.7 0.9 0.1
10/20 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.9
11/20 0.8 0.9 0.3 0.4 0.3 0.5 0.3 0.2 0.9 0.1
12/20 0.3 0.6 0.9 0.5 0.6 0.7 0.8 0.7 0.9 0.1
10/20 0.1 0.2 0.3 0.4 0.5 0.7 0.7 0.8 0.9 0.9
11/20 0.8 0.9 0.3 0.4 0.3 0.7 0.3 0.2 0.4 0.1
12/20 0.3 0.6 0.9 0.5 0.6 0.7 0.8 0.7 0.9 0.1

I want to delete all columns for which the same value repeats, consecutively, more than 3 times. In this example, the columns "stock1", "stock6" and "stock9" should be deleted. In the other columns, we have repeating values more than 3 times, but not one after the other. I think I can adapt the code from that question Removing values that repeat more than 5 times in Pandas DataFrame, but I could not manage to do that yet.


Solution

  • You can set "date" aside as index, then check if the rows are different from the next one as use it to groupby+cumcount.

    Then compute the max count per column, if greater than N-1, drop the column:

    df2 = df.set_index('date')
    N = 3
    df2.loc[:, df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount()).max().lt(N-1)]
    

    output:

            stock2   stock3   stock4   stock5   stock7   stock8   stock10
    date                                                                 
    10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
    11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
    12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1
    10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
    11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
    12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1
    10/20       0.2      0.3      0.4      0.5      0.7      0.8      0.9
    11/20       0.9      0.3      0.4      0.3      0.3      0.2      0.1
    12/20       0.6      0.9      0.5      0.6      0.8      0.7      0.1
    

    intermediate count of successive values:

    >>> df2.apply(lambda c: c.groupby(c.ne(c.shift()).cumsum()).cumcount())
    
            stock1   stock2   stock3   stock4   stock5   stock6   stock7   stock8   stock9   stock10
    date                                                                                            
    10/20         0        0        0        0        0        0        0        0        0        0
    11/20         1        0        1        1        0        0        0        0        0        0
    12/20         2        0        0        0        0        0        0        0        0        1
    10/20         3        0        0        0        0        0        0        0        1        0
    11/20         0        0        1        1        0        0        0        0        2        0
    12/20         0        0        0        0        0        0        0        0        3        1
    10/20         0        0        0        0        0        1        0        0        4        0
    11/20         0        0        1        1        0        2        0        0        0        0
    12/20         0        0        0        0        0        3        0        0        0        1