Search code examples
pandaspandas-groupbyvectorization

Run groupby cumcount on multiple columns


I have a dataframe with multiple columns, all are ordered in ascending order:

     40  41  42  43  44  45  46  47  48  49
0     1   1   1   1   1   1   1   1   1   1
1     1   1   1   1   1   1   1   1   1   1
2     1   1   1   1   1   2   1   1   1   1
3     1   1   1   1   1   2   1   1   1   1
4     1   1   1   1   1   2   2   1   1   1
..   ..  ..  ..  ..  ..  ..  ..  ..  ..  ..
367  18  26  25  25  30  25  27  27  30  29
368  18  26  26  25  30  25  27  27  31  29
369  18  27  27  25  30  25  27  27  31  29
370  19  27  27  25  30  25  27  27  31  29
371  19  27  27  25  30  25  27  27  31  29

I would want to groupby each column's values and run cumcount. I know I could iterate through all the columns, but as people say you should avoid iteration as much as you. So I would like to know if there is a more elegant solution.


Solution

  • If you have a reasonable number of columns, using apply on the columns is actually not that bad:

    df.apply(lambda c: c.groupby(c).cumcount())
    

    output:

         40  41  42  43  44  45  46  47  48  49
    0     0   0   0   0   0   0   0   0   0   0
    1     1   1   1   1   1   1   1   1   1   1
    2     2   2   2   2   2   0   2   2   2   2
    3     3   3   3   3   3   1   3   3   3   3
    4     4   4   4   4   4   2   0   4   4   4
    367   0   0   0   0   0   0   0   0   0   0
    368   1   1   0   1   1   1   1   1   0   1
    369   2   0   0   2   2   2   2   2   1   2
    370   0   1   1   3   3   3   3   3   2   3
    371   1   2   2   4   4   4   4   4   3   4