Search code examples
rfor-loopif-statementvectorcomparison

Is there a faster way of creating a new vector by comparing 4 other vectors through rows i and i-1, in R?


Imagine you have a dataset of customers with their purchase history.

The data is ordered by customer and by the date of their activity a.k.a. purchase

The goal is to calculate the frequency of their purchasing, but fastly

Data <- tibble(Customer = c("Person A", "Person A", "Person A", "Person A", "Person A", "Person A","Person B", "Person C","Person C"),
           First_Activity_Date = c(1,1,1,1,1,1,1,1,1),   # imagine these numbers as dates
           Activity_Date = c(1,2,3,4,5,6,1,1,2),         
           Last_Activity_Date =c(6,6,6,6,6,6,1,2,2)
           )

View(Data)

tic()
h <- vector( "integer", length = 9)
f <- function(x, y, z, q){
     for( i in 1:length(x)){
         if ( identical(z[i],y[i])) { h[i] <- 1 }
         else if ( identical(x[i],x[i-1]) && (z[i]<=q[i])) { h[i] <- (h[i-1]+1) }
       }
     return(h)
     }

Data <- mutate(Data, Frequency = f(Customer, First_Activity_Date, 
Activity_Date, Last_Activity_Date) )

View(Data)
toc()



#Data <- select( Data, Customer, First_Activity_Date, Activity_Date, Last_Activity_Date) 
#remove(h)
#remove(f) 

It works fine with a small dataset filled with numbers, but with a row number over 50K filled with dates it needs about 2 minutes..

Is there a way to vectorize this function/calculation?


Solution

  • Let's build up an alternative solution

    f1 <- function(x, y, z, q) {
    

    Allocate the result vector inside the function, using arguments passed to the function

        h <- integer(length(x)) # allocate the result inside the function
    

    Your loop consists of parts that can be 'vectorized' (one function call, rather than a function call for each iteration of the loop). Write the vectorized versions

        tst_1 <- z == y        # 'hoist' outside loop as vectorized comparison
        h[tst_1] <- 1L         # update h; '1L': integer, not '1': numeric
    

    The else part of the conditional has a bug when i == 1, because one tries to compare x[1] to the non-existent x[0]. Let's suppose that we never enter the conditional for i == 1, so the vectorized version is

        tst_2 <- !tst_1 & c(FALSE, tail(x, -1) == head(x, -1)) & z <= q
    

    The most straight-forward way to implement the update of h is a simple loop like

        for (i in which(tst_2))
            h[i] <- h[i - 1] + 1L
    

    and finally return the result

        h
    }
    

    The full function, tweaked slightly, is

    f1 <- function(x, y, z, q) {
        h <- integer(length(x)) # allocate the result inside the function
        ## if (...)
        h[z == y] <- 1L
        ## else if (...)
        tst <- !h & c(FALSE, x[-1] == x[-length(x)]) & z <= q
        for (i in which(tst))
            h[i] <- h[i - 1] + 1L
        h
    }
    

    Performance can be improved further by focusing on the remaining for() loop, but perhaps this already gets you to the performance needed, without being too cryptic?

    One could also more cleanly separate the 'filter' operation of selecting relevant events

    keep <- (y >= z) & (z <= q)
    x0 <- x[keep]
    

    from the process of operating on each group. Here you're creating a group-wise sequence from 1 to the number of members of the group. Several approaches are

    h0 <- ave(seq_along(x0), x0, FUN=seq_along)
    

    or

    grp_size = rle(x0)$lengths
    offset = rep(cumsum(c(0L, grp_size[-length(grp_size)])), grp_size)
    h0 <- seq_len(sum(grp_size)) - offset
    

    or

    grp_size = tabulate(match(x0, unique(x0)))
    offset = rep(cumsum(c(0L, grp_size[-length(grp_size)])), grp_size)
    h0 <- seq_len(sum(grp_size)) - offset
    

    Other solutions to this problem are found elsewhere on StackOverflow. The final step is to create the return value

    h <- integer(length(x))
    h[keep] <- h0
    h
    

    Data is a tibble, so perhaps you're familiar with dplyr. One way of achieving the result in an intelligible but not necessarily efficient way is

    d0 <- Data %>%
        filter(
            Activity_Date >= First_Activity_Date, 
            Activity_Date <= Last_Activity_Date
        ) %>% 
        group_by(Customer) %>%
        mutate(Frequency = seq_along(Customer))
    left_join(Data, d0)