Search code examples
pandastime-seriescounting

Count different actions within one hour in python


I am starting to work with time series. I have one of a user doing bank transfers to different countries, however the most frequent country to where he/she is doing the transfers is X, but there are transfers also to the countries Y and Z. Let's say:

date                           id       country
2020-01-01T00:00:00.000Z       id_01     X
2020-01-01T00:20:00.000Z       id_02     X
2020-01-01T00:25:00.000Z       id_03     Y
2020-01-01T00:35:00.000Z       id_04     X
2020-01-01T00:45:00.000Z       id_05     Z
2020-01-01T01:00:00.000Z       id_06     X
2020-01-01T10:20:00.000Z       id_07     X
2020-01-01T10:25:00.000Z       id_08     X
2020-01-01T13:00:00.000Z       id_09     X
2020-01-01T18:45:00.000Z       id_10     Z
2020-01-01T18:55:00.000Z       id_11     X

Since the most frequent country is X, I would like to count iteratively how many transactions have been done within one hour (in the whole list of events) to countries different than X.

The format of the expected output for this particular case would be:

date                           id        country
2020-01-01T00:25:00.000Z       id_03     Y
2020-01-01T00:45:00.000Z       id_05     Z

Starting from 2020-01-01T00:00:00.000Z, within one hour there are two Y, Z transactions. Then starting from 2020-01-01T00:20:00.000Z, within one hour, there are the same transactions, and so on. Then, starting from 2020-01-01T10:20:00.000Z, within one hour, all are X. Starting from 2020-01-01T18:45:00.000Z, within one hour, there is only one Z.

I am trying with a double for loop and .value_counts(), but I'm not sure of what I am doing.


Solution

  • IIUC, you can select only the rows not X, then use diff once forward and once backward (within 1 hour before and after) and you want where any of the two diff is below a Timedelta of 1h.

    #convert to datetime
    df['date'] = pd.to_datetime(df['date'])
    
    #mask not X and select only these rows
    mX = df['country'].ne('X')
    df_ = df[mX].copy()
    
    # mask within an hour before and after 
    m1H = (df_['date'].diff().le(pd.Timedelta(hours=1)) | 
            df_['date'].diff(-1).le(pd.Timedelta(hours=1)) )
    
    # selet only the rows meeting criteria on X and 1H
    df_ = df_[m1H]
    print (df_)
                           date     id country
    2 2020-01-01 00:25:00+00:00  id_03       Y
    4 2020-01-01 00:45:00+00:00  id_05       Z