Search code examples
rdatefill

How to fill missing values grouped on id and based on time period from index date


I want to fill in missing values for a data.frame based on a period of time within groups of ID.

For the latest registration_dat within the same ID group, I want to fill in with previous values in the ID group but only if the registration_dat is within 1 year of the latest registration_dat in the ID group.

Sample version of my data:

ID registration_dat  value1      value2
1  2020-03-04          NA          NA
1  2019-05-06          33          25
1  2019-01-02          32          21
3  2021-10-31          NA          NA
3  2018-10-12          33          NA
3  2018-10-10          25          35
4  2020-01-02          NA          NA
4  2019-10-31          32          83
4  2019-09-20          33          56
8  2019-12-12          NA          NA
8  2019-10-31          NA          43
8  2019-08-12          32          46

Desired output:

ID registration_dat  value1      value2
1  2020-03-04          33          25
1  2019-05-06          33          25
1  2019-01-02          32          21
3  2021-10-31          NA          NA
3  2018-10-12          33          NA
3  2018-10-10          25          35
4  2020-01-02          32          83
4  2019-10-31          32          83
4  2019-09-20          33          56
8  2019-12-12          32          43
8  2019-10-31          NA          43
8  2019-08-12          32          46

I am later filtering the data so that i get one unique ID based on the latest registration date and I want this row to have as little missing data as possible hence I want to do this for all columns in the dataframe. However I do not want NA values being filled in by values in previous dates if its more than 1 year apart from the latest registration date. My dataframe has 14 columns and 3 million+ rows so I would need it to work on a much bigger data.frame than the one shown as an example.

I'd appreciate any ideas!


Solution

  • You could make a small function (f, below) to handle each value column.

    1. Make a grouped ID, and generate a rowid (this is only to retain your original order)
    dat <- dat %>% 
      mutate(rowid = row_number()) %>% 
      arrange(registration_dat) %>% 
      group_by(ID)
    
    1. Make a function that takes a df and val column, and returns and updated df with val fixed
    f <- function(df, val) {
      bind_rows(
        df %>% filter(is.na({{val}}) & row_number()!=n()),
        df %>% filter(!is.na({{val}}) | row_number()==n()) %>% 
          mutate({{val}} := if_else(is.na({{val}}) & registration_dat-lag(registration_dat)<365, lag({{val}}),{{val}}))
      )
    }
    
    1. Apply the function to the columns of interest
    dat = f(dat,value1)
    dat = f(dat,value2)
    
    1. If you want, recover the original order
    dat %>% arrange(rowid) %>% select(-rowid)
    

    Output:

          ID registration_dat value1 value2
       <int> <date>            <int>  <int>
     1     1 2020-03-04           33     25
     2     1 2019-05-06           33     25
     3     1 2019-01-02           32     21
     4     3 2021-10-31           NA     NA
     5     3 2018-10-12           33     NA
     6     3 2018-10-10           25     35
     7     4 2020-01-02           32     83
     8     4 2019-10-31           32     83
     9     4 2019-09-20           33     56
    10     8 2019-12-12           32     46
    11     8 2019-10-31           NA     43
    12     8 2019-08-12           32     46
    

    Update:

    The OP wants the final row (i.e the last registration_dat) per ID. With 3 million rows and 14 value columns, I would use data.table and do something like this:

    library(data.table)
    
    f <- function(df) {
      df = df[df[1,registration_dat]-registration_dat<=365]
      df[1,value:=df[2:.N][!is.na(value)][1,value]][1]
    }
    
    dcast(
      melt(setDT(dat), id=c("ID", "registration_dat"))[order(-registration_dat),f(.SD), by=.(ID,variable)],
      ID+registration_dat~variable, value.var="value"
    )
    

    Output:

          ID registration_dat value1 value2
       <int>           <Date>  <int>  <int>
    1:     1       2020-03-04     33     25
    2:     3       2021-10-31     NA     NA
    3:     4       2020-01-02     32     83
    4:     8       2019-12-12     32     43