Search code examples
rtidyversedata-manipulation

Impute values in grouped data by condition in R


I like to impute a variable in grouped paneldata with tidyverse logic. The story is this: It is survey data and people are asked in particular years (time) for a behavior in the last couple of years. Thus I assume when someone said "I had a car for 5 years", that the car variable in those years can set to be 1. The question was not asked in those years. This is minimal data and the imputation I like to achieve.

paneldata = data.frame(id=c(rep(1,10),rep(2,10)), 
                       time=seq(1:10), 
                       car=c(1,NA,NA,NA,NA,0,NA,NA,NA,1,1,NA,NA,NA,1,NA,NA,NA,NA,1),
                       car_imp_goal=c(1,NA,NA,NA,NA,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
paneldata

Here is what I tried

paneldata <- paneldata %>% mutate(car_imp_trial = car) 
paneldata %>% group_by(id) %>% fill(car_imp_trial , .direction = "up")


# A tibble: 20 × 5
# Groups:   id [2]
      id  time   car car_imp_goal car_imp_trial 
   <dbl> <int> <dbl>   <dbl>    <dbl>
 1     1     1     1       1        1
 2     1     2    NA      NA        0
 3     1     3    NA      NA        0
 4     1     4    NA      NA        0
 5     1     5    NA      NA        0
 6     1     6     0       0        0
 7     1     7    NA       1        1
 8     1     8    NA       1        1
 9     1     9    NA       1        1
10     1    10     1       1        1
11     2     1     1       1        1
12     2     2    NA       1        1
13     2     3    NA       1        1
14     2     4    NA       1        1
15     2     5     1       1        1
16     2     6    NA       1        1
17     2     7    NA       1        1
18     2     8    NA       1        1
19     2     9    NA       1        1
20     2    10     1       1        1

The past behavior question is only asked in specificy years (e.g. time 5 and 10). I need to group_by(id) then use ifelse condition to select relevant time, i.e. 5 or 10 then was thinking about using fill. What is wrong about car_imp_trial is that it filled 0 from year 6, which is not a pasted behaviour question.


Solution

  • Create a time interval id, then fill upwards the car column

    paneldata%>%
       group_by(id,id2 = cut_interval(time, length = 5,labels =FALSE))%>%
       fill(car, .direction = 'up')
    
    # A tibble: 20 × 5
    # Groups:   id, id2 [4]
          id  time   car car_imp   id2
       <dbl> <int> <dbl>   <dbl> <int>
     1     1     1     1       1     1
     2     1     2    NA      NA     1
     3     1     3    NA      NA     1
     4     1     4    NA      NA     1
     5     1     5    NA      NA     1
     6     1     6     0       0     2
     7     1     7     1       1     2
     8     1     8     1       1     2
     9     1     9     1       1     2
    10     1    10     1       1     2
    11     2     1     1       1     1
    12     2     2     1       1     1
    13     2     3     1       1     1
    14     2     4     1       1     1
    15     2     5     1       1     1
    16     2     6     1       1     2
    17     2     7     1       1     2
    18     2     8     1       1     2
    19     2     9     1       1     2
    20     2    10     1       1     2