I have a follow up question related to this prior StackOverflow question.
Suppose I have the following NumPy array:
import numpy as np
v = np.array([
0, 0, 0, 1, 3, 3, 1, 1, 1, 1, 1, 0, 2, 3, 2, 1, 1, 0, 0, 1, 3, 3,
3, 2, 0, 0, 0, 0, 1, 1, 1, 2, 1, 0, 0, 0, 0, 0, 1, 2, 2, 1, 0, 0,
1, 1, 1, 0, 0, 0, 1, 2, 2, 1, 0, 0, 1, 1, 1, 1, 1, 2, 1, 1, 2, 0,
0, 1, 2, 2, 2, 2, 1, 1, 2, 2, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 2,
2, 0, 0, 0, 0, 1])
I am trying to obtain a list of all the repeating-element sequences and their starting indices. My hunch is that using Pandas is the most straightforward way to achieve this.
Using the previously referenced StackOverflow answer, I write the following:
import pandas as pd
df = pd.DataFrame(v, columns=['digit'])
df["seq_len"] = df.groupby(
(df["digit"] != df["digit"].shift()).cumsum()
)["digit"].cumcount()+1
Producing the result:
digit seq_len
0 0 1
1 0 2
2 0 3
3 1 1
4 3 1
.. ... ...
89 0 1
90 0 2
91 0 3
92 0 4
93 1 1
The last thing I need is to remove the duplicates along the "digit" column in such a way that the last "seq_len" value is kept. Normally, you could use Pandas duplicated
or drop_duplicates
, however these functions don't do any resetting along the column.
What I don't want is:
>>> df.drop_duplicates(subset='digit', keep='last')
digit seq_len
22 3 3
88 2 3
92 0 4
93 1 1
What I do want is something like:
>>> magic_function(df)
digit seq_len
2 0 3
3 1 1
5 3 2
10 1 5
.. ... ...
88 2 3
92 0 4
93 1 1
Of course, if I do "index - seq_len + 1", I can obtain the true starting indices, e.g.,
index digit seq_len
0 0 3
3 1 1
4 3 2
6 1 5
.. ... ...
86 2 3
89 0 4
92 1 1
So anyways, looking for any advice on an efficient magic_function()
to do the above. Appreciate all the help!
You can use GroupBy.apply
with boolean indexing :
m = df["digit"].ne(df["digit"].shift())
grp = df.groupby(m.cumsum(), group_keys=False)
df["seq_len"] = grp["digit"].cumcount().add(1)
out = grp.apply(lambda g: g.loc[~g["digit"].duplicated(keep="last")])
Output :
print(out)
digit seq_len
2 0 3
3 1 1
5 3 2
10 1 5
11 0 1
.. ... ...
83 0 7
85 1 2
88 2 3
92 0 4
93 1 1
[43 rows x 2 columns]