Search code examples
rimputationlocf

Impute missing values with a value from previous month (if exists)


I have a dataframe with more than 100 000 rows and 30 000 unique ids.

My aim is to fill all the NAs among the different columns if there is a value from the previous month and the same id. However, most of the times the previous recorded value is from more than a month ago. Those NAs I would like to leave untouched.

The id column and the date column do not have NAs. Here is an example of the data I have:

df3

id    oxygen      gluco        dias         bp        date
1   0,25897842  0,20201604  0,17955655  0,14100962  31.7.2019
2   NA          NA          0,38582622  0,12918231  31.12.2014
2   0,35817147  0,32943499  NA          0,43667462  30.11.2018
2   0,68557053  0,42898807  0,93897514  NA          31.10.2018
2   NA          NA          0,99899076  0,44168223  31.7.2018
2   0,43848054  0,38604586  NA          NA          30.4.2013
2   0,15823254  0,06216771  0,07829624  0,69755251  31.1.2016
2   NA          NA          0,61645303  NA          29.2.2016
2   0,94671363  0,50682091  0,96770222  0,97403356  31.5.2018
3   NA          0,77352235  0,660479    0,11554399  30.4.2019
3   0,15567703  NA          0,4553325   NA          31.3.2017
3   NA          NA          0,22181609  0,08527658  30.9.2017
3   0,93660763  NA          NA          NA          31.3.2018
3   0,73416759  NA          NA          0,78501791  30.11.2018
3   NA          NA          NA          NA          28.2.2019
3   0,84525106  0,54360374  NA          0,40595426  31.8.2014
3   0,76221263  0,62983336  0,84592719  0,10640734  31.8.2013
4   NA          0,29108942  0,3863479   NA          31.1.2018
4   0,74075742  NA          0,38117415  0,58849266  30.11.2018
4   0,09400641  0,68860814  NA          0,88895224  31.8.2014
4   0,72202944  0,49901387  0,19967415  NA          31.8.2018
4   0,98205262  0,85213969  0,34450998  0,98962306  30.11.2013


This is the last code implementation that I have tried:

´´´

df3 %>%
    group_by(id) %>%
    mutate_all(funs(na.locf(., na.rm = FALSE, maxgap = 30)))

´´´

But apparently "mutate_all() ignored the following grouping variables: Column id"


Solution

  • You can use the tidyverse for that. Here's an approach:

    • Change the date column to class Date, then order by date
    • Prepare the dates and remove the days in Ym
    • get the time difference in mo
    • flag the rows which have max one month difference
    • get groups by cumsum the inverse logic in flag
    • fill the rows from the same groups
    library(dplyr)
    library(tidyr)
    library(lubridate)
    
    df$date <- as.Date(df$date, format="%d.%m.%Y")
    
    df %>% 
      arrange(date) %>%
      mutate(
         Ym = ym(strftime(date, "%Y-%m")), 
         mo = interval(Ym, lag(Ym, default=as.Date("1970-01-01"))) / months(1), 
         flag = cumsum(!(mo > -2 & mo < 1))) %>% 
      group_by(id, flag) %>% 
      fill(names(.), .direction="down") %>% 
      ungroup() %>% 
      select(-c("Ym","mo","flag")) %>% 
      print(n=nrow(.))
    

    Output

    # A tibble: 22 × 6
          id oxygen     gluco      dias       bp         date      
       <int> <chr>      <chr>      <chr>      <chr>      <date>    
     1     2 0,43848054 0,38604586 NA         NA         2013-04-30
     2     3 0,76221263 0,62983336 0,84592719 0,10640734 2013-08-31
     3     4 0,98205262 0,85213969 0,34450998 0,98962306 2013-11-30
     4     3 0,84525106 0,54360374 NA         0,40595426 2014-08-31
     5     4 0,09400641 0,68860814 NA         0,88895224 2014-08-31
     6     2 NA         NA         0,38582622 0,12918231 2014-12-31
     7     2 0,15823254 0,06216771 0,07829624 0,69755251 2016-01-31
     8     2 0,15823254 0,06216771 0,61645303 0,69755251 2016-02-29
     9     3 0,15567703 NA         0,4553325  NA         2017-03-31
    10     3 NA         NA         0,22181609 0,08527658 2017-09-30
    11     4 NA         0,29108942 0,3863479  NA         2018-01-31
    12     3 0,93660763 NA         NA         NA         2018-03-31
    13     2 0,94671363 0,50682091 0,96770222 0,97403356 2018-05-31
    14     2 NA         NA         0,99899076 0,44168223 2018-07-31
    15     4 0,72202944 0,49901387 0,19967415 NA         2018-08-31
    16     2 0,68557053 0,42898807 0,93897514 NA         2018-10-31
    17     2 0,35817147 0,32943499 0,93897514 0,43667462 2018-11-30
    18     3 0,73416759 NA         NA         0,78501791 2018-11-30
    19     4 0,74075742 NA         0,38117415 0,58849266 2018-11-30
    20     3 NA         NA         NA         NA         2019-02-28
    21     3 NA         0,77352235 0,660479   0,11554399 2019-04-30
    22     1 0,25897842 0,20201604 0,17955655 0,14100962 2019-07-31
    

    Data

    df <- structure(list(id = c(1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L), oxygen = c("0,25897842", 
    NA, "0,35817147", "0,68557053", NA, "0,43848054", "0,15823254", 
    NA, "0,94671363", NA, "0,15567703", NA, "0,93660763", "0,73416759", 
    NA, "0,84525106", "0,76221263", NA, "0,74075742", "0,09400641", 
    "0,72202944", "0,98205262"), gluco = c("0,20201604", NA, "0,32943499", 
    "0,42898807", NA, "0,38604586", "0,06216771", NA, "0,50682091", 
    "0,77352235", NA, NA, NA, NA, NA, "0,54360374", "0,62983336", 
    "0,29108942", NA, "0,68860814", "0,49901387", "0,85213969"), 
        dias = c("0,17955655", "0,38582622", NA, "0,93897514", "0,99899076", 
        NA, "0,07829624", "0,61645303", "0,96770222", "0,660479", 
        "0,4553325", "0,22181609", NA, NA, NA, NA, "0,84592719", 
        "0,3863479", "0,38117415", NA, "0,19967415", "0,34450998"
        ), bp = c("0,14100962", "0,12918231", "0,43667462", NA, "0,44168223", 
        NA, "0,69755251", NA, "0,97403356", "0,11554399", NA, "0,08527658", 
        NA, "0,78501791", NA, "0,40595426", "0,10640734", NA, "0,58849266", 
        "0,88895224", NA, "0,98962306"), date = structure(c(18108, 
        16435, 17865, 17835, 17743, 15825, 16831, 16860, 17682, 18016, 
        17256, 17439, 17621, 17865, 17955, 16313, 15948, 17562, 17865, 
        16313, 17774, 16039), class = "Date")), row.names = c(NA, 
    -22L), class = "data.frame")