Search code examples
pythonpandasgroup-bysequence

Pandas: sequence data over date based on column change


i have timeseries data spanning multiple days that i need to sequence (i.e., create a column each time a value changes). i have the sequencing working without groupby, but i'm a little lost on how to apply the same or similar code to the grouped data.

my data looks like:

index   timestamp       value
0       1684713605000   1
1       1684713610000   1
2       1684713611000   1
3       1684713614000   0
4       1684713615000   0
5       1684713616000   0
6       1684713619000   1
7       1684713620000   1
8       1684713621000   1
9       1684832896000   1
10      1684832897000   1
11      1684832898000   1
12      1684832901000   0
13      1684832902000   0
14      1684832903000   0
15      1684832906000   1
16      1684832907000   1
17      1684832908000   1

my timestamp column is not guaranteed to be sequential, but is generally one record per second of the day. i need my desired sequence column to increment up until the end of the day, then begin counting again at 0 the next day.

the code i'm using to sequence is:

subset = df[["value"]].copy()

subset["change"] = (subset["value"].shift() != subset["value"]) * 1

subset["seq"] = subset["change"].cumsum(axis = 0) - 1

df["seq"] = subset["seq"]

i've been able to create groups with:

subset = df[["timestamp", "value"]].copy()
subset["date"] = pd.to_datetime(subset["timestamp"], unit="ms", origin="unix").dt.date
g = subset.groupby("date")

but i'm not sure how to proceed. my desired result is a sequence column that increments every time value changes but resets

index   timestamp       value   seq
0       1684713605000   1       0
1       1684713610000   1       0
2       1684713611000   1       0
3       1684713614000   0       1
4       1684713615000   0       1
5       1684713616000   0       1
6       1684713619000   1       2
7       1684713620000   1       2
8       1684713621000   1       2
9       1684832896000   1       0    <-- first record of a new day
10      1684832897000   1       0
11      1684832898000   1       0
12      1684832901000   0       1
13      1684832902000   0       1
14      1684832903000   0       1
15      1684832906000   1       2
16      1684832907000   1       2
17      1684832908000   1       2

Solution

  • Try:

    t = pd.to_datetime(df["timestamp"], unit="ms", origin="unix").dt.date
    
    df["seq"] = df.groupby(t, group_keys=False)["value"].apply(
        lambda x: x.ne(x.shift()).cumsum() - 1
    )
    
    print(df)
    

    Prints:

        index      timestamp  value  seq
    0       0  1684713605000      1    0
    1       1  1684713610000      1    0
    2       2  1684713611000      1    0
    3       3  1684713614000      0    1
    4       4  1684713615000      0    1
    5       5  1684713616000      0    1
    6       6  1684713619000      1    2
    7       7  1684713620000      1    2
    8       8  1684713621000      1    2
    9       9  1684832896000      1    0
    10     10  1684832897000      1    0
    11     11  1684832898000      1    0
    12     12  1684832901000      0    1
    13     13  1684832902000      0    1
    14     14  1684832903000      0    1
    15     15  1684832906000      1    2
    16     16  1684832907000      1    2
    17     17  1684832908000      1    2