Search code examples
python-3.xpandasdataframedatetimedata-analysis

Pandas count monthly occurrences with across-rows condition


I have a Dataframe like this

                              oper_status
2012-01-01 00:26:54.250            0
2012-01-01 12:11:54.250            1
2012-01-01 13:57:54.250            2
2012-01-02 00:16:54.250            0
2012-01-02 14:26:54.250            1
2012-01-02 17:20:54.250            0
2012-01-04 08:21:54.250            0
2012-01-04 15:34:54.250            1
2012-01-04 19:45:54.250            0
2012-01-05 01:00:54.250            0
2012-01-05 12:46:54.250            1
2012-01-05 20:27:54.250            2
        (...)                    (...)

and I want to count how many times every month I have consecutive values with this pattern: 0,1,2. I tried looping on the rows with iterrows() but it's very slow since I have a big dataset. I also thought about using "diff" but I can't figure out a simple way to do it. Thanks

EDIT: The expected output is like this

              count
time                      
2012-03-31     244
2012-04-30     65
2012-05-31     167
2012-06-30     33
2012-07-31     187
            ...     ...
2013-05-31     113
2013-06-30     168
2013-07-31     294
2013-08-31     178
2013-09-30     65

Solution

  • Counting sequential patterns is a two step process. First, build a sequence for each row, representing the pattern ending at that row:

    df['seq'] = df.order_status.astype(str).shift(periods=0) + '-' + 
                df.order_status.astype(str).shift(periods=1) + '-' + 
                df.order_status.astype(str).shift(periods=2)
    
                          date  order_status    seq
    0  2012-01-01 00:26:54.250             0    NaN
    1  2012-01-01 12:11:54.250             1    NaN
    2  2012-01-01 13:57:54.250             2  2-1-0
    3  2012-01-02 00:16:54.250             0  0-2-1
    4  2012-01-02 14:26:54.250             1  1-0-2
    5  2012-01-02 17:20:54.250             0  0-1-0
    6  2012-01-04 08:21:54.250             0  0-0-1
    7  2012-01-04 15:34:54.250             1  1-0-0
    8  2012-01-04 19:45:54.250             0  0-1-0
    9  2012-01-05 01:00:54.250             0  0-0-1
    10 2012-01-05 12:46:54.250             1  1-0-0
    11 2012-01-05 20:27:54.250             2  2-1-0
    

    Then, filter down to only the correct sequences and aggregate up to your desired level:

    df['month'] = df.date.dt.month    
    df[df.seq == '2-1-0'].groupby("month").month.count()
    
    month
    1    2
    

    Alter as necessary to handle cases where you want patterns starting in a certain period, stopping there, entirely within, etc...