Search code examples
rnafeature-engineering

How to complete cases by group


I want to calculate ratios of certain variables for every id

For example, if i have 3 periods, 2 ids and certain measure, as in this df

df <- data.frame(
          date = c(202001L, 202002L, 202003L, 202001L, 202002L, 202003L),
            id = c("a", "a", "a", "b", "b", "b"),
         value = c(23L, 43L, 123L, 56L, 23L, 13L))

I should be able to just to this

df_mod <- df %>% 
    group_by(id) %>% 
    mutate(value_var_1 = (value - dplyr::lag(value, 1))/dplyr::lag(value, 1)) %>% 
    ungroup()

But if, for example, id "a" is missing date = '202002', would mean the lag would be made between 202001 and 202003, that would no longer be a lag of 1, as in this example

df <- data.frame(
  date = c(202001L, 202003L, 202001L, 202002L, 202003L),
  id = c("a", "a", "b", "b", "b"),
  value = c(23L, 123L, 56L, 23L, 13L)
)

df_mod <- df %>% 
  group_by(id) %>% 
  mutate(value_var_1 = (value - dplyr::lag(value, 1))/dplyr::lag(value, 1)) %>% 
  ungroup()

How can i make sure that id "a" has the 3 periods I need to make the calculations? How can i fill the missing dates with the last value?


Solution

  • You can use tidyr::complete to complete the missing combinations and the perform the calculations for each id :

    library(dplyr)
    
    df %>%
      tidyr::complete(id, date = unique(date)) %>%
      group_by(id) %>%
      mutate(value_var_1 = (value - lag(value))/lag(value)) %>% 
      ungroup()
    

    If every id has different dates safer would be to convert to date class, create a sequence of monthly dates for each id.

    df %>%
      mutate(date = as.Date(paste0(date, 1), '%Y%m%d')) %>%
      arrange(id, date) %>%
      group_by(id) %>%
      tidyr::complete(date = seq(min(date), max(date), by = 'month')) %>%
      mutate(value_var_1 = (value - lag(value))/lag(value)) %>% 
      ungroup()