Search code examples
pythonpandasrolling-computation

how to count occurrences of specific string in previous x rows


I have a list of activities and the approximate timestamp they occur in. I would like to count the occurences of a string in the previous 'x' rows (walking or running etc.) and add it to the dataframe. Pandas DataFrame does not support rolling (for non-numeric data) and I'm not sure if I can use shift to check like the previous 30, 50 or even 70 rows of data. I haven't made any concrete progress yet as I have been looking for similar questions/solutions on the site.

    timestamp   event
0   2021-12-18 18:20:25+08:00   running
1   2021-12-18 18:20:27+08:00   running
2   2021-12-18 18:20:29+08:00   walking
3   2021-12-18 18:20:31+08:00   walking
4   2021-12-18 18:20:33+08:00   walking
5   2021-12-18 18:20:35+08:00   walking
6   2021-12-18 18:20:37+08:00   walking
7   2021-12-18 18:20:39+08:00   walking
8   2021-12-18 18:20:41+08:00   stationary
9   2021-12-18 18:20:43+08:00   stationary
10  2021-12-18 18:20:45+08:00   stationary
11  2021-12-18 18:20:47+08:00   stationary

df.loc[:, 'Count previous K'] = 0  # new column to count previous row activities

expected output:

    timestamp   event   Count previous K
0   2021-12-18 18:20:25+08:00   running 0
1   2021-12-18 18:20:27+08:00   running 0
2   2021-12-18 18:20:29+08:00   walking 1
3   2021-12-18 18:20:31+08:00   walking 2
4   2021-12-18 18:20:33+08:00   walking 3
5   2021-12-18 18:20:35+08:00   walking 4
6   2021-12-18 18:20:37+08:00   walking 5
7   2021-12-18 18:20:39+08:00   walking 6
8   2021-12-18 18:20:41+08:00   stationary  6
9   2021-12-18 18:20:43+08:00   stationary  6
10  2021-12-18 18:20:45+08:00   stationary  6
11  2021-12-18 18:20:47+08:00   stationary  6
12  2021-12-18 18:20:49+08:00   stationary  5

for a window size of 10 (including current index/row) counting occurences of walking.


Solution

  • You can use a boolean to see when a particular event is occurring, then perform a rolling sum on the boolean series. As @mozway pointed out, the argument min_periods=1 will avoid NaN appearing at the beginning of the resulting DataFrame:

    df['walking_count'] = (df['event'] == 'walking').rolling(5, min_periods=1).sum()
    

    This sets a new column 'walking_count' to the following series:

    0     0.0
    1     0.0
    2     1.0
    3     2.0
    4     3.0
    5     4.0
    6     5.0
    7     5.0
    8     4.0
    9     3.0
    10    2.0
    11    1.0