Search code examples
rdataframefor-loopindexingvectorization

Vectorize dataframe in a for loop


How would you vectorize this for loop in R? This is just a small example but i would need to do this for 11 columns.

for(i in 2:nrow(df1)){
  if(df1[i, 'sku2'] == ''){
    df1[i,'sku2'] <- df1[i - 1, 'sku2']
  }
}
return(df1)

Data

df1 <- data.frame(sku2 = rep(c(1:2, ""), times = 5),
                  a    = rep(c(3:4, ""), times = 5))

Solution

  • The task here, if I understand it, is to replace blanks in each column with the most recent non-blank of that column.

    Here's a vectorized tidyverse approach:

    library(tidyverse)
    df1 %>%
     mutate_all(na_if,"") %>%
     fill(names(df1), .direction = "down)
    

    This takes the df1 data frame, converts blanks to NAs, then uses tidyr::fill on every column to use the last non-blank value.

    I expect this will be much faster [edit: it is 100-1000x faster] than your loop, but if you want "the fastest possible" approach, it may be worth looking into approaches using the data.frame or collapse packages.

    Performance

    For an example with 100,000 rows and 3 columns, this approach was 100x faster: 0.156 seconds instead of 15.5 seconds.

    For 200k rows, the loop took 150x as much time: 31 seconds vs. 0.2 seconds.

    For 400k rows, the loop was 600x slower: 123 seconds vs. still 0.2 sec for the vectorized version.

    I ran out of patience for testing larger data with the loop, but the vectorized version is still only 0.4 seconds with 5 million rows...

    set.seed(42)
    n = 1E5
    df1 <- data.frame(sku2 = sample(c(1:4, ""), n, replace = TRUE),
                      sku3 = sample(c(1:8, ""), n, replace = TRUE),
                      sku4 = sample(c(1:20, ""), n, replace = TRUE))