I need to improve my code because it is way too slow. Maybe you can help me?
Starting situation: I have a dataframe with two accounts (account1 and account2). I know for both accounts the value for specific days. For example: account1 has on day 1 the value 1000.
Task: There are 150 possible days for each account. I have the value for specific days for each account. Now I want to create a dataframe with every possible day for each account and the associated value. The associated value for each possible day is the last known/obtained value out of the original dataframe.
Example: account1 has on day 1 the value 1000 and on day 10 the value 5000. Therefore days 1 to 9 have the value 1000. Following this for account1 days 10 to 34 have the value 5000. Days 35 to 69 have the value 10000. Days 70 to 89 have the value 8000 and finally days 90 to 150 have the value 7000 for account1.
My code does what I want but is way too slow. Do you have some helping tipps?
library(tidyverse)
# The dataframe
df <- data.frame(identifier = c("account1", "account1", "account1", "account1", "account1", "account2", "account2", "account2", "account2", "account2"),
sample_day = c(1, 10, 35, 70, 90, 1, 40, 80, 100, 135),
value = c(1000, 5000, 10000, 8000, 7000, 3000, 6000, 9000, 8000, 1000))
# This is where I save my results
df_result <- data.frame()
# Within the for loop I go through every identifier
for (id in unique(df$id)) {
# First I filter my data for the selected identifier
df_temp <- df %>% filter(identifier == id)
# Every identifier has 150 days and I wanto to know the value for each day
all_days <- c(1:150)
# Here I save my temporary results
df_days_value <- data.frame()
# I want to figure out the value for each day within the 150 days.
# The correct value each day is the last known value from the original dataframe "df"
for (day in all_days) {
help_df <- df_temp %>% filter(sample_day <= day)
df_days_value_temp <- help_df %>% filter(sample_day == max(sample_day))
df_days_value <- rbind(df_days_value, df_days_value_temp)
}
df_days_value$all_days <- all_days
df_result <- rbind(df_result, df_days_value)
}
There are helper functions in the tidyr
package that make this very easy: complete()
to get all combinations of days and identifiers, and fill
to fill in the missing values by the last observation:
library(tidyr)
library(dplyr)
result = df %>%
complete(identifier, sample_day = 1:150) %>%
group_by(identifier) %>%
fill(value) %>%
ungroup()
print(result, n = 20)
# # A tibble: 300 × 3
# identifier sample_day value
# <chr> <dbl> <dbl>
# 1 account1 1 1000
# 2 account1 2 1000
# 3 account1 3 1000
# 4 account1 4 1000
# 5 account1 5 1000
# 6 account1 6 1000
# 7 account1 7 1000
# 8 account1 8 1000
# 9 account1 9 1000
# 10 account1 10 5000
# 11 account1 11 5000
# 12 account1 12 5000
# 13 account1 13 5000
# 14 account1 14 5000
# 15 account1 15 5000
# 16 account1 16 5000
# 17 account1 17 5000
# 18 account1 18 5000
# 19 account1 19 5000
# 20 account1 20 5000
# # … with 280 more rows
# # ℹ Use `print(n = ...)` to see more rows