Imagine you have a dataset of customers with their purchase history.
The data is ordered by customer and by the date of their activity a.k.a. purchase
The goal is to calculate the frequency of their purchasing, but fastly
Data <- tibble(Customer = c("Person A", "Person A", "Person A", "Person A", "Person A", "Person A","Person B", "Person C","Person C"),
First_Activity_Date = c(1,1,1,1,1,1,1,1,1), # imagine these numbers as dates
Activity_Date = c(1,2,3,4,5,6,1,1,2),
Last_Activity_Date =c(6,6,6,6,6,6,1,2,2)
)
View(Data)
tic()
h <- vector( "integer", length = 9)
f <- function(x, y, z, q){
for( i in 1:length(x)){
if ( identical(z[i],y[i])) { h[i] <- 1 }
else if ( identical(x[i],x[i-1]) && (z[i]<=q[i])) { h[i] <- (h[i-1]+1) }
}
return(h)
}
Data <- mutate(Data, Frequency = f(Customer, First_Activity_Date,
Activity_Date, Last_Activity_Date) )
View(Data)
toc()
#Data <- select( Data, Customer, First_Activity_Date, Activity_Date, Last_Activity_Date)
#remove(h)
#remove(f)
It works fine with a small dataset filled with numbers, but with a row number over 50K filled with dates it needs about 2 minutes..
Is there a way to vectorize this function/calculation?
Let's build up an alternative solution
f1 <- function(x, y, z, q) {
Allocate the result vector inside the function, using arguments passed to the function
h <- integer(length(x)) # allocate the result inside the function
Your loop consists of parts that can be 'vectorized' (one function call, rather than a function call for each iteration of the loop). Write the vectorized versions
tst_1 <- z == y # 'hoist' outside loop as vectorized comparison
h[tst_1] <- 1L # update h; '1L': integer, not '1': numeric
The else
part of the conditional has a bug when i == 1
, because one tries to compare x[1]
to the non-existent x[0]
. Let's suppose that we never enter the conditional for i == 1
, so the vectorized version is
tst_2 <- !tst_1 & c(FALSE, tail(x, -1) == head(x, -1)) & z <= q
The most straight-forward way to implement the update of h
is a simple loop like
for (i in which(tst_2))
h[i] <- h[i - 1] + 1L
and finally return the result
h
}
The full function, tweaked slightly, is
f1 <- function(x, y, z, q) {
h <- integer(length(x)) # allocate the result inside the function
## if (...)
h[z == y] <- 1L
## else if (...)
tst <- !h & c(FALSE, x[-1] == x[-length(x)]) & z <= q
for (i in which(tst))
h[i] <- h[i - 1] + 1L
h
}
Performance can be improved further by focusing on the remaining for()
loop, but perhaps this already gets you to the performance needed, without being too cryptic?
One could also more cleanly separate the 'filter' operation of selecting relevant events
keep <- (y >= z) & (z <= q)
x0 <- x[keep]
from the process of operating on each group. Here you're creating a group-wise sequence from 1 to the number of members of the group. Several approaches are
h0 <- ave(seq_along(x0), x0, FUN=seq_along)
or
grp_size = rle(x0)$lengths
offset = rep(cumsum(c(0L, grp_size[-length(grp_size)])), grp_size)
h0 <- seq_len(sum(grp_size)) - offset
or
grp_size = tabulate(match(x0, unique(x0)))
offset = rep(cumsum(c(0L, grp_size[-length(grp_size)])), grp_size)
h0 <- seq_len(sum(grp_size)) - offset
Other solutions to this problem are found elsewhere on StackOverflow. The final step is to create the return value
h <- integer(length(x))
h[keep] <- h0
h
Data
is a tibble, so perhaps you're familiar with dplyr. One way of achieving the result in an intelligible but not necessarily efficient way is
d0 <- Data %>%
filter(
Activity_Date >= First_Activity_Date,
Activity_Date <= Last_Activity_Date
) %>%
group_by(Customer) %>%
mutate(Frequency = seq_along(Customer))
left_join(Data, d0)