Search code examples
rdateaggregatelubridatefeature-engineering

How can I make specific periods for each account separately starting from their first transaction date?


This is my transaction data:

id          from_id        to_id      amount    date_trx
<fctr>      <fctr>         <fctr>     <dbl>     <date>
0           7468           5695       700.0     2005-01-04
1           6213           9379       11832.0   2005-01-08
2           7517           8170       1000.0    2005-01-10
3           6143           9845       4276.0    2005-01-12
4           6254           9640       200.0     2005-01-14
5           6669           5815       200.0     2005-01-20
6           6934           8583       49752.0   2005-01-24
7           9240           8314       19961.0   2005-01-26
8           6374           8865       1000.0    2005-01-30
9           6143           6530       13.4      2005-01-31
...

I want to build new features based on time intervals.

Let's look at this:

id               from_id        to_id      amount       date_trx
<fctr>           <fctr>         <fctr>     <dbl>        <date>
149431           5370           5735       1000.0       2007-03-24
157403           5370           7058       3679.0       2007-04-13
158831           5370           8667       12600.0      2007-04-23
162680           5370           6053       19.2         2007-04-30
167082           5370           8165       3679.0       2007-05-13
168562           5370           5656       2100.0       2007-05-23
172578           5370           5929       79.0         2007-05-31
177507           5370           6725       3679.0       2007-06-01
179167           5370           8433       200.0        2007-06-22
183499           5370           7022       100.6        2007-06-29
...

Let's say, I want to calculate the amount of money transacted, for example, in week periods for each account.
So, starting from 2007-03-24, 5370's weekly transaction amount history is as follows:

in the 1st week(2007-03-24 - 2007-03-31):  1000.0          
in the 2nd week(2007-03-31 - 2007-04-07):  0.0          
in the 3rd week(2007-04-07 - 2007-04-14):  3679.0          
in the 4th week(2007-04-14 - 2007-04-21):  0.0          
in the 5th week(2007-04-21 - 2007-04-28):  12600.0          
in the 6th week(2007-04-28 - 2007-05-05):  19.2          
in the 7th week(2007-05-05 - 2007-05-12):  0.0          
in the 8th week(2007-05-12 - 2007-05-19):  3679.0          
in the 9th week(2007-05-19 - 2007-05-26):  2100.0          
in the 10th week(2007-05-26 - 2007-06-02):  79.0 + 3679.0 = 3758.0          
in the 11th week(2007-06-02 - 2007-06-09):  0.0          
in the 12th week(2007-06-09 - 2007-06-16):  0.0          
in the 13th week(2007-06-16 - 2007-06-23):  200.0          
in the 14th week(2007-06-23 - 2007-06-30):  100.6          

Here we see that the max amount 5370 transacted in a week period is 12600.0. So, now I want to see this measure as a feature, say max_of_weekly_transacted_amount.
Similarly, I want to calculate say the mean amount transacted in a month period for each account and store it as another feature, say mean_of_monthly_transacted_amount

I tried lubridate function floor_date:

# Max of weekly transaction amount
data <- data %>% group_by(date_trx_week=floor_date(date_trx, "week"),from_id) %>% mutate(weekly_trx = sum(amount)) %>% 
   group_by(from_id) %>% mutate(max_of_weekly_transacted_amount=max(weekly_trx))%>% 
   select(-c(date_trx_week,weekly_trx))


# Mean of monthly transaction amount
data <- data %>% group_by(date_trx_month=floor_date(date_trx, "month"),from_id) %>% mutate(monthly_trx = sum(amount)) %>% 
   group_by(from_id) %>% mutate(mean_of_monthly_transacted_amount=mean(monthly_trx))%>% 
   select(-c(date_trx_month,monthly_trx))

The date variable date_trx in my data starts with 2005-01-01 and ends with 2010-12-31. floor_date starts week periods with 2005-01-02-2005-01-09 and continues with 2005-01-09-2005-01-16 and so on. It starts month periods with 2005-01-01-2005-02-01 and continues with 2005-02-01-2005-03-01 and so on. And this function uses the same periods for each account.

But I want to make periods spesifically for each account based on the first transaction date they made. So, for from_id = 5370 first transaction date is 2007-03-24. If I want to make week periods for 5370, it would be 2007-03-24 - 2007-03-31, 2007-03-31 - 2007-04-07, and so on. If I want to make month periods for 5370, it would be 2007-03-24 - 2007-04-24, 2007-04-24 - 2007-05-24, and so on.
For another account, periods would be different. So, how can I achieve this? How can I make specific periods for each account separately starting from their first transaction date?


Solution

  • The following is all in Base-R. First, the custom weeks. This uses mod 7 on the difference between the first entry and the rest. Hopefully someone can use this approach and make you a more efficient data.table solution.

    df$Week <- unsplit(tapply(df$date_trx,df$from_id, function(x) as.numeric((x-x[1])) %/% 7 ),df$from_id)
    

    Then if you want to take say the mean by ID by Week you can

    aggregate(amount ~ from_id + Week, df, mean)
    
       from_id Week  amount
    1     5370    0  1000.0
    2     6143    0  4276.0
    3     6213    0 11832.0
    4     6254    0   200.0
    5     6374    0  1000.0
    6     6669    0   200.0
    7     6934    0 49752.0
    8     7468    0   700.0
    9     7517    0  1000.0
    10    9240    0 19961.0
    11    5370    2  3679.0
    12    6143    2    13.4
    13    5370    4 12600.0
    14    5370    5    19.2
    15    5370    7  3679.0
    16    5370    8  2100.0
    17    5370    9  1879.0
    18    5370   12   200.0
    19    5370   13   100.6
    

    Of course you can replace mean with max or any other function.