Search code examples
rdataframepanelnamissing-data

NAs in longitudinal/panel data


I am dealing with NAs in a panel. I have five countries and the same variable is measured across five years

policy.1 <- data.frame(country= c("A","B","C","D","E"),
                       year1=c(2,4,NA,3,5),
                       year2=c(3,NA,NA,4,6),
                       year3=c(4,8,2,NA,7),
                       year4=c(5,10,4,NA,8),
                       year5=c(6,12,1,NA,9))

Note how country B only has one NA but C and D 2 and 3 in a row, respectively.

My imputation strategy is to compute the mean of the value before and after the NA.

In the case of country B--with a single NA--this should be the mean between 4 and 8.

Countries C, D, and E are trickier.

C has the first two values missing. I want to populate the missing values with the first available value in the row.

D is the reverse: The last three values are missing. I want to populate missing values with the last available one

E has three missing middle values. I want to have the missing populated with the mean right before and after the stretch of NAs in the row. In other words, the three missing values should be replaced with the mean of (5+9)/2=7.

This is what the output should look like:

policy.2 <- data.frame(country= c("A","B","C","D","E"),
                       year1=c(2,4,2,3,5),
                       year2=c(3,6,2,4,7),
                       year3=c(4,8,2,4,7),
                       year4=c(5,10,4,4,7),
                       year5=c(6,12,1,4,9))

Thank you


Solution

  • library(tidyr)
    library(dplyr)
    
    policy.1 |> 
      pivot_longer(starts_with("year")) |>
      group_by(country) |>
      mutate(value = ifelse(is.na(value), 
                           (lead(value) + lag(value)) / 2,
                           value),,
             value = ifelse(is.na(value),
                            (first(value) + last(value)) / 2,
                            value)
             ) |> 
      fill(value, .direction = "downup") |>
      pivot_wider()
    

    These actions are easier to perform in long format. So first we pivot the data and then group by country. The order of the ifelse is important.

    1. First we try computing the mean with the surrounding values. If any of values before or after are NA it returns NA, which is what we had to start. If both values are present the mean is computed. This is done with country B.
    2. For any NA values that remain, we fill them if the mean of the first and last values. If either of those are NA it returns NA. If both values are present the mean is computed. This is done with country E.
    3. Lastly we fill NA carrying values down and then back up. This is done for countries C and D.

    Lastly, we pivot the data set back to wide format.

    Output

      country year1 year2 year3 year4 year5
      <chr>   <dbl> <dbl> <dbl> <dbl> <dbl>
    1 A           2     3     4     5     6
    2 B           4     6     8    10    12
    3 C           2     2     2     4     1
    4 D           3     4     4     4     4
    5 E           5     7     7     7     9
    

    Data

    policy.1 <- data.frame(country= c("A","B","C","D","E"),
                           year1=c(2,4,NA,3,5),
                           year2=c(3,NA,NA,4,NA),
                           year3=c(4,8,2,NA,NA),
                           year4=c(5,10,4,NA,NA),
                           year5=c(6,12,1,NA,9))