Search code examples
rdataframedplyrsum

How to use R to replace missing values with the sum of previous 4 values in a column?


I have a dataframe that contains (among other things) three columns that have missing values every 5 rows. These missing values need to be replaced with the sum of the previous 4 values in their respective column.

For example, let's say my dataframe looked like this:

id        category1 category2 category3
123         5        10        10
123         6        11        15
123         6        12        23
123         4        10         6
123        NA        NA        NA
567        24        17        15

Those NAs need to represent a "total" based on the sum of the previous 4 values in their column, and this needs to repeat throughout the entire dataframe because the NAs occur every 5 rows. For instance, the three NAs in the mock example above should be replaced with 21, 43, and 54. 5 rows later, the same process will need to be repeated. How can I achieve this?


Solution

  • Another possible solution:

    library(dplyr)
    
    df %>% 
      group_by(id) %>% 
      mutate(across(everything(), ~ if_else(is.na(.x), sum(.x, na.rm = T), .x))) %>% 
      ungroup
    
    #> # A tibble: 6 × 4
    #>      id category1 category2 category3
    #>   <int>     <int>     <int>     <int>
    #> 1   123         5        10        10
    #> 2   123         6        11        15
    #> 3   123         6        12        23
    #> 4   123         4        10         6
    #> 5   123        21        43        54
    #> 6   567        24        17        15