Search code examples
pandasgroup-by

"Cumcount with Reset" and "Keep Last with Reset" in Python


I have a follow up question related to this prior StackOverflow question.

Suppose I have the following NumPy array:

import numpy as np
v = np.array([
  0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
  3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
  1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
  0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
  2, 0, 0, 0, 0, 1])

I am trying to obtain a list of all the repeating-element sequences and their starting indices. My hunch is that using Pandas is the most straightforward way to achieve this.

Using the previously referenced StackOverflow answer, I write the following:

import pandas as pd
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby(
    (df["digit"] != df["digit"].shift()).cumsum()
    )["digit"].cumcount()+1

Producing the result:

    digit  seq_len
0       0        1
1       0        2
2       0        3
3       1        1
4       3        1
..    ...      ...
89      0        1
90      0        2
91      0        3
92      0        4
93      1        1

The last thing I need is to remove the duplicates along the "digit" column in such a way that the last "seq_len" value is kept. Normally, you could use Pandas duplicated or drop_duplicates, however these functions don't do any resetting along the column.

What I don't want is:

>>> df.drop_duplicates(subset='digit', keep='last')
    digit  seq_len
22      3        3
88      2        3
92      0        4
93      1        1

What I do want is something like:

>>> magic_function(df)
    digit  seq_len
2       0        3
3       1        1
5       3        2
10      1        5
..    ...      ...
88      2        3
92      0        4
93      1        1

Of course, if I do "index - seq_len + 1", I can obtain the true starting indices, e.g.,

index    digit  seq_len
0            0        3
3            1        1
4            3        2
6            1        5
..         ...      ...
86           2        3
89           0        4
92           1        1

So anyways, looking for any advice on an efficient magic_function() to do the above. Appreciate all the help!


Solution

  • You can use GroupBy.apply with boolean indexing :

    m = df["digit"].ne(df["digit"].shift())
    grp = df.groupby(m.cumsum(), group_keys=False)
    
    df["seq_len"] = grp["digit"].cumcount().add(1)
    
    out = grp.apply(lambda g: g.loc[~g["digit"].duplicated(keep="last")])
    

    Output :

    print(out)
    
        digit  seq_len
    2       0        3
    3       1        1
    5       3        2
    10      1        5
    11      0        1
    ..    ...      ...
    83      0        7
    85      1        2
    88      2        3
    92      0        4
    93      1        1
    
    [43 rows x 2 columns]