Search code examples
rrowfillpanel-data

Fill down a range of rows in R


I have a panel dataset in R, which includes observations per group over time (month). The following dataframe is a snapshot of the complete dataframe:

df <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"),first_value = c("A","BC","D", NA,NA,NA, "D","G","H", "K","L", NA,NA, "DE","GH"),second_value = c(1,5,7, NA,NA,NA, 2,3,9, 7,1, NA,NA, 4,4))

The dataset is already arranged by group and time. As you can see, observations ("first_value*"* and *"*second_value") can be completely empty for a group in a given month (here February, but can be any month except the first and the last month for every group). What I want to achieve is that the empty months are filled with the last non-empty previous month within a group.

I want to get the following dataframe:

df_filled <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2),month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"),first_value = c("A","BC","D", "A","BC","D", "D","G","H", "K","L", "K","L", "DE","GH"),second_value = c(1,5,7, 1,5,7, 2,3,9, 7,1, 7,1, 4,4))

Please note that, by construction, the last non-empty previous month always has the same number of observations than the following empty months.

I tried different commands with fill() from the dplyr package and na.locf () from the zoo package but all I achieved was filling down the last row of the last non-empty previous month, so that

df_filled <- data.frame(group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2), month = c("January", "January", "January", "February", "February", "February", "March", "March", "March", "January", "January", "February", "February", "March", "March"), first_value = c("A","BC","D", "D","D","D", "D","G","H", "K","L", "L","L", "DE","GH"), second_value = c(1,5,7, 7,7,7, 2,3,9, 7,1, 1,1, 4,4))

Looking forward to your suggestions. thanks.


Solution

  • An approach using row_number, assuming there are no 2 consecutive month with NA.

    library(dplyr)
    
    df %>% 
      mutate(n_na = sum(is.na(first_value)), .by = c(group, month)) %>% 
      mutate(across(ends_with("_value"), ~ 
               if_else(is.na(.x), .x[row_number() - n_na], .x)), .by = group, 
             n_na = NULL)
       group    month first_value second_value
    1      1  January           A            1
    2      1  January          BC            5
    3      1  January           D            7
    4      1 February           A            1
    5      1 February          BC            5
    6      1 February           D            7
    7      1    March           D            2
    8      1    March           G            3
    9      1    March           H            9
    10     2  January           K            7
    11     2  January           L            1
    12     2 February           K            7
    13     2 February           L            1
    14     2    March          DE            4
    15     2    March          GH            4
    

    If consecutive NA months are possible with this approach it takes a bit of group juggling

    df %>% 
      mutate(n_na = sum(is.na(first_value)), .by = c(group, month)) %>% 
      mutate(n_lag = lag(n_na, default=0), .by = group) %>% 
      mutate(n_lag = n_na + first(n_lag), 
             n_na = if_else(n_na != 0, n_lag, n_na), 
             n_lag = NULL, .by = c(group, month)) %>% 
      mutate(across(ends_with("_value"), ~ 
               if_else(is.na(.x), .x[row_number() - n_na], .x)), .by = group, 
             n_na = NULL)