Search code examples
rperformancevectorization

How can I improve my R code especially performance?


I need to improve my code because it is way too slow. Maybe you can help me?

Starting situation: I have a dataframe with two accounts (account1 and account2). I know for both accounts the value for specific days. For example: account1 has on day 1 the value 1000.

Task: There are 150 possible days for each account. I have the value for specific days for each account. Now I want to create a dataframe with every possible day for each account and the associated value. The associated value for each possible day is the last known/obtained value out of the original dataframe.

Example: account1 has on day 1 the value 1000 and on day 10 the value 5000. Therefore days 1 to 9 have the value 1000. Following this for account1 days 10 to 34 have the value 5000. Days 35 to 69 have the value 10000. Days 70 to 89 have the value 8000 and finally days 90 to 150 have the value 7000 for account1.

My code does what I want but is way too slow. Do you have some helping tipps?

library(tidyverse)

# The dataframe 
df <- data.frame(identifier = c("account1", "account1", "account1", "account1", "account1", "account2", "account2", "account2", "account2", "account2"), 
                 sample_day = c(1, 10, 35, 70, 90, 1, 40, 80, 100, 135), 
                 value = c(1000, 5000, 10000, 8000, 7000, 3000, 6000, 9000, 8000, 1000))

# This is where I save my results 
df_result <- data.frame()

# Within the for loop I go through every identifier 
for (id in unique(df$id)) {   
   # First I filter my data for the selected identifier
   df_temp <- df %>% filter(identifier == id)   
   # Every identifier has 150 days and I wanto to know the value for each day   
   all_days <- c(1:150)   
   # Here I save my temporary results   
   df_days_value <- data.frame()   
   # I want to figure out the value for each day within the 150 days. 
   # The correct value each day is the last known value from the original dataframe "df"   
   for (day in all_days) {
      help_df <- df_temp %>% filter(sample_day <= day)
      df_days_value_temp <- help_df %>% filter(sample_day == max(sample_day))
      df_days_value <- rbind(df_days_value, df_days_value_temp)
    }
    df_days_value$all_days <- all_days
    df_result <- rbind(df_result, df_days_value)
}

Solution

  • There are helper functions in the tidyr package that make this very easy: complete() to get all combinations of days and identifiers, and fill to fill in the missing values by the last observation:

    library(tidyr)
    library(dplyr)
    result = df %>% 
      complete(identifier, sample_day = 1:150) %>%
      group_by(identifier) %>%
      fill(value) %>%
      ungroup()
    print(result, n = 20)
    # # A tibble: 300 × 3
    #    identifier sample_day value
    #    <chr>           <dbl> <dbl>
    #  1 account1            1  1000
    #  2 account1            2  1000
    #  3 account1            3  1000
    #  4 account1            4  1000
    #  5 account1            5  1000
    #  6 account1            6  1000
    #  7 account1            7  1000
    #  8 account1            8  1000
    #  9 account1            9  1000
    # 10 account1           10  5000
    # 11 account1           11  5000
    # 12 account1           12  5000
    # 13 account1           13  5000
    # 14 account1           14  5000
    # 15 account1           15  5000
    # 16 account1           16  5000
    # 17 account1           17  5000
    # 18 account1           18  5000
    # 19 account1           19  5000
    # 20 account1           20  5000
    # # … with 280 more rows
    # # ℹ Use `print(n = ...)` to see more rows