Search code examples
rtime-seriesdata-cleaning

Filling missing rows for time series data in R


Thanks to many great Stackoverflow posts, I have a solution to fill missing rows for time series data. But my major concern is if there is any way to make this more concise and shorter. I am working with data like below:

df <- data.frame(
        id = c("A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C"),
        week = c(-13, -2, 4, 5, 6, 3, 4, 5, -8, -5, 3), 
        last_week = c(6, 6, 6, 6, 6, 5, 5, 5, 3, 3, 3),
        first_week = c(-20, -20, -20, -20, -20, 2, 2, 2, -3, -3, -3),
        dv = c(3, 2, 2, 1, 4, 5, 2, 3, 1, 1, 2)
      )

My goal is threefold:

1) If first_week is smaller than -10, I should have every rows starting from -10 to last_week. i.e., id A should have rows for weeks -10 to 6.

2) If first_week is larger than 0, I should have every rows starting from 1 to last_week. i.e., id B should have rows for weeks 1 to 5.

3) For all other cases, I should have every rows starting from first_week to last_week. i.e., id C should have rows for weeks -3 to 3.

Right now, my solution is like the following:

loop_for_filling <- function(df){
    for(i in unique(df$id)){
      current_id_df <- filter(df, id == i)
      current_id_last_week <- unique(current_id_df$last_week)
      current_id_first_week <- unique(current_id_df$first_week)

      # Create a sequence of weeks to be filled
      if(current_id_first_week > 0){
        all_weeks = seq(1, current_id_last_week)
      } else if(current_id_first_week < -10){
          all_weeks = seq(-10, current_id_last_week)
      } else{
            all_weeks = seq(current_id_first_week, current_id_last_week)
            current_id_df = filter(current_id_df, week >= first_week)
      }

      # Create a dataframe with rows for every week btwn last_week and first_week
      current_id_all <- data.frame(list(week = all_weeks)) %>% mutate(id = i)

      # Merge two dataframes
      current_id_new_df <- merge(current_id_df, current_id_all, all = T) %>% 
        subset(., select = -c(last_week, first_week)) %>% 
        filter(week >= -10)

      # Bind current_person_new_dfs
      if(i == unique(df$id)[[1]]){all_file <- current_id_new_df}
      if(i != unique(df$id)[[1]]){all_file <- rbind(all_file, current_id_new_df)}
    }

    all_file

  }

  df2 <- loop_for_filling(df)
  df2

This certainly works, but I'm working with a large dataset (50k ids) and I was wondering if there would be any ways to deal with this issue in a shorter and more concise way so I don't need to be staring at my loop for three hours :)

Thank you!


Solution

  • I presume this will run much faster. First I establish the range of weeks to be should be shown for each id, applying the specified adjustments. Then I use tidyr::uncount() to make rows for each of the needed id-week combinations. Finally, I join to the original data.

    library(tidyverse)
    df_ranges <- df %>%
      distinct(id, first_week, last_week) %>% 
      mutate(first_week = case_when(first_week < -10 ~ -10,
                                    first_week > 0   ~   1,
                                    TRUE             ~ first_week)) %>%
      mutate(week_count = last_week - first_week + 1)
    
    df2b <- df_ranges %>%
      uncount(week_count, .id = "week") %>%
      mutate(week = first_week + week - 1) %>%
      select(id, week) %>%
      left_join(df %>% select(id, week, dv))
    
    identical(df2b, df2)
    #[1] TRUE