This is my transaction data:
id from_id to_id amount date_trx
<fctr> <fctr> <fctr> <dbl> <date>
0 7468 5695 700.0 2005-01-04
1 6213 9379 11832.0 2005-01-08
2 7517 8170 1000.0 2005-01-10
3 6143 9845 4276.0 2005-01-12
4 6254 9640 200.0 2005-01-14
5 6669 5815 200.0 2005-01-20
6 6934 8583 49752.0 2005-01-24
7 9240 8314 19961.0 2005-01-26
8 6374 8865 1000.0 2005-01-30
9 6143 6530 13.4 2005-01-31
...
I want to build new features based on time intervals.
Let's look at this:
id from_id to_id amount date_trx
<fctr> <fctr> <fctr> <dbl> <date>
149431 5370 5735 1000.0 2007-03-24
157403 5370 7058 3679.0 2007-04-13
158831 5370 8667 12600.0 2007-04-23
162680 5370 6053 19.2 2007-04-30
167082 5370 8165 3679.0 2007-05-13
168562 5370 5656 2100.0 2007-05-23
172578 5370 5929 79.0 2007-05-31
177507 5370 6725 3679.0 2007-06-01
179167 5370 8433 200.0 2007-06-22
183499 5370 7022 100.6 2007-06-29
...
Let's say, I want to calculate the amount of money transacted, for example, in week periods for each account.
So, starting from 2007-03-24
, 5370
's weekly transaction amount history is as follows:
in the 1st week(2007-03-24 - 2007-03-31): 1000.0
in the 2nd week(2007-03-31 - 2007-04-07): 0.0
in the 3rd week(2007-04-07 - 2007-04-14): 3679.0
in the 4th week(2007-04-14 - 2007-04-21): 0.0
in the 5th week(2007-04-21 - 2007-04-28): 12600.0
in the 6th week(2007-04-28 - 2007-05-05): 19.2
in the 7th week(2007-05-05 - 2007-05-12): 0.0
in the 8th week(2007-05-12 - 2007-05-19): 3679.0
in the 9th week(2007-05-19 - 2007-05-26): 2100.0
in the 10th week(2007-05-26 - 2007-06-02): 79.0 + 3679.0 = 3758.0
in the 11th week(2007-06-02 - 2007-06-09): 0.0
in the 12th week(2007-06-09 - 2007-06-16): 0.0
in the 13th week(2007-06-16 - 2007-06-23): 200.0
in the 14th week(2007-06-23 - 2007-06-30): 100.6
Here we see that the max amount 5370
transacted in a week period is 12600.0
. So, now I want to see this measure as a feature, say max_of_weekly_transacted_amount
.
Similarly, I want to calculate say the mean amount transacted in a month period for each account and store it as another feature, say mean_of_monthly_transacted_amount
I tried lubridate function floor_date
:
# Max of weekly transaction amount
data <- data %>% group_by(date_trx_week=floor_date(date_trx, "week"),from_id) %>% mutate(weekly_trx = sum(amount)) %>%
group_by(from_id) %>% mutate(max_of_weekly_transacted_amount=max(weekly_trx))%>%
select(-c(date_trx_week,weekly_trx))
# Mean of monthly transaction amount
data <- data %>% group_by(date_trx_month=floor_date(date_trx, "month"),from_id) %>% mutate(monthly_trx = sum(amount)) %>%
group_by(from_id) %>% mutate(mean_of_monthly_transacted_amount=mean(monthly_trx))%>%
select(-c(date_trx_month,monthly_trx))
The date variable date_trx
in my data starts with 2005-01-01
and ends with 2010-12-31
. floor_date
starts week periods with 2005-01-02-2005-01-09
and continues with 2005-01-09-2005-01-16
and so on. It starts month periods with 2005-01-01-2005-02-01
and continues with 2005-02-01-2005-03-01
and so on. And this function uses the same periods for each account.
But I want to make periods spesifically for each account based on the first transaction date they made. So, for from_id = 5370
first transaction date is 2007-03-24
. If I want to make week periods for 5370
, it would be 2007-03-24 - 2007-03-31
, 2007-03-31 - 2007-04-07
, and so on. If I want to make month periods for 5370
, it would be 2007-03-24 - 2007-04-24
, 2007-04-24 - 2007-05-24
, and so on.
For another account, periods would be different. So, how can I achieve this? How can I make specific periods for each account separately starting from their first transaction date?
The following is all in Base-R. First, the custom weeks. This uses mod 7 on the difference between the first entry and the rest. Hopefully someone can use this approach and make you a more efficient data.table
solution.
df$Week <- unsplit(tapply(df$date_trx,df$from_id, function(x) as.numeric((x-x[1])) %/% 7 ),df$from_id)
Then if you want to take say the mean by ID by Week you can
aggregate(amount ~ from_id + Week, df, mean)
from_id Week amount
1 5370 0 1000.0
2 6143 0 4276.0
3 6213 0 11832.0
4 6254 0 200.0
5 6374 0 1000.0
6 6669 0 200.0
7 6934 0 49752.0
8 7468 0 700.0
9 7517 0 1000.0
10 9240 0 19961.0
11 5370 2 3679.0
12 6143 2 13.4
13 5370 4 12600.0
14 5370 5 19.2
15 5370 7 3679.0
16 5370 8 2100.0
17 5370 9 1879.0
18 5370 12 200.0
19 5370 13 100.6
Of course you can replace mean
with max
or any other function.