r for-loop if-statement vector comparison

Is there a faster way of creating a new vector by comparing 4 other vectors through rows i and i-1, in R?

Imagine you have a dataset of customers with their purchase history.

The data is ordered by customer and by the date of their activity a.k.a. purchase

The goal is to calculate the frequency of their purchasing, but fastly

Data <- tibble(Customer = c("Person A", "Person A", "Person A", "Person A", "Person A", "Person A","Person B", "Person C","Person C"),
           First_Activity_Date = c(1,1,1,1,1,1,1,1,1),   # imagine these numbers as dates
           Activity_Date = c(1,2,3,4,5,6,1,1,2),         
           Last_Activity_Date =c(6,6,6,6,6,6,1,2,2)
           )

View(Data)

tic()
h <- vector( "integer", length = 9)
f <- function(x, y, z, q){
     for( i in 1:length(x)){
         if ( identical(z[i],y[i])) { h[i] <- 1 }
         else if ( identical(x[i],x[i-1]) && (z[i]<=q[i])) { h[i] <- (h[i-1]+1) }
       }
     return(h)
     }

Data <- mutate(Data, Frequency = f(Customer, First_Activity_Date, 
Activity_Date, Last_Activity_Date) )

View(Data)
toc()



#Data <- select( Data, Customer, First_Activity_Date, Activity_Date, Last_Activity_Date) 
#remove(h)
#remove(f)

It works fine with a small dataset filled with numbers, but with a row number over 50K filled with dates it needs about 2 minutes..

Is there a way to vectorize this function/calculation?

Solution

Let's build up an alternative solution

f1 <- function(x, y, z, q) {

Allocate the result vector inside the function, using arguments passed to the function

    h <- integer(length(x)) # allocate the result inside the function

Your loop consists of parts that can be 'vectorized' (one function call, rather than a function call for each iteration of the loop). Write the vectorized versions

    tst_1 <- z == y        # 'hoist' outside loop as vectorized comparison
    h[tst_1] <- 1L         # update h; '1L': integer, not '1': numeric

The else part of the conditional has a bug when i == 1, because one tries to compare x[1] to the non-existent x[0]. Let's suppose that we never enter the conditional for i == 1, so the vectorized version is

    tst_2 <- !tst_1 & c(FALSE, tail(x, -1) == head(x, -1)) & z <= q

The most straight-forward way to implement the update of h is a simple loop like

    for (i in which(tst_2))
        h[i] <- h[i - 1] + 1L

and finally return the result

h
}

The full function, tweaked slightly, is

f1 <- function(x, y, z, q) {
    h <- integer(length(x)) # allocate the result inside the function
    ## if (...)
    h[z == y] <- 1L
    ## else if (...)
    tst <- !h & c(FALSE, x[-1] == x[-length(x)]) & z <= q
    for (i in which(tst))
        h[i] <- h[i - 1] + 1L
    h
}

Performance can be improved further by focusing on the remaining for() loop, but perhaps this already gets you to the performance needed, without being too cryptic?

One could also more cleanly separate the 'filter' operation of selecting relevant events

keep <- (y >= z) & (z <= q)
x0 <- x[keep]

from the process of operating on each group. Here you're creating a group-wise sequence from 1 to the number of members of the group. Several approaches are

h0 <- ave(seq_along(x0), x0, FUN=seq_along)

grp_size = rle(x0)$lengths
offset = rep(cumsum(c(0L, grp_size[-length(grp_size)])), grp_size)
h0 <- seq_len(sum(grp_size)) - offset

grp_size = tabulate(match(x0, unique(x0)))
offset = rep(cumsum(c(0L, grp_size[-length(grp_size)])), grp_size)
h0 <- seq_len(sum(grp_size)) - offset

Other solutions to this problem are found elsewhere on StackOverflow. The final step is to create the return value

h <- integer(length(x))
h[keep] <- h0
h

Data is a tibble, so perhaps you're familiar with dplyr. One way of achieving the result in an intelligible but not necessarily efficient way is

d0 <- Data %>%
    filter(
        Activity_Date >= First_Activity_Date, 
        Activity_Date <= Last_Activity_Date
    ) %>% 
    group_by(Customer) %>%
    mutate(Frequency = seq_along(Customer))
left_join(Data, d0)