I am dealing with NAs in a panel. I have five countries and the same variable is measured across five years
policy.1 <- data.frame(country= c("A","B","C","D","E"),
year1=c(2,4,NA,3,5),
year2=c(3,NA,NA,4,6),
year3=c(4,8,2,NA,7),
year4=c(5,10,4,NA,8),
year5=c(6,12,1,NA,9))
Note how country B only has one NA but C and D 2 and 3 in a row, respectively.
My imputation strategy is to compute the mean of the value before and after the NA.
In the case of country B--with a single NA--this should be the mean between 4 and 8.
Countries C, D, and E are trickier.
C has the first two values missing. I want to populate the missing values with the first available value in the row.
D is the reverse: The last three values are missing. I want to populate missing values with the last available one
E has three missing middle values. I want to have the missing populated with the mean right before and after the stretch of NAs in the row. In other words, the three missing values should be replaced with the mean of (5+9)/2=7.
This is what the output should look like:
policy.2 <- data.frame(country= c("A","B","C","D","E"),
year1=c(2,4,2,3,5),
year2=c(3,6,2,4,7),
year3=c(4,8,2,4,7),
year4=c(5,10,4,4,7),
year5=c(6,12,1,4,9))
Thank you
library(tidyr)
library(dplyr)
policy.1 |>
pivot_longer(starts_with("year")) |>
group_by(country) |>
mutate(value = ifelse(is.na(value),
(lead(value) + lag(value)) / 2,
value),,
value = ifelse(is.na(value),
(first(value) + last(value)) / 2,
value)
) |>
fill(value, .direction = "downup") |>
pivot_wider()
These actions are easier to perform in long format. So first we pivot the data and then group by country
. The order of the ifelse
is important.
NA
it returns NA
, which is what we had to start. If both values are present the mean is computed. This is done with country B.NA
values that remain, we fill them if the mean of the first
and last
values. If either of those are NA
it returns NA
. If both values are present the mean is computed. This is done with country E.NA
carrying values down and then back up. This is done for countries C and D.Lastly, we pivot the data set back to wide format.
Output
country year1 year2 year3 year4 year5
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 2 3 4 5 6
2 B 4 6 8 10 12
3 C 2 2 2 4 1
4 D 3 4 4 4 4
5 E 5 7 7 7 9
Data
policy.1 <- data.frame(country= c("A","B","C","D","E"),
year1=c(2,4,NA,3,5),
year2=c(3,NA,NA,4,NA),
year3=c(4,8,2,NA,NA),
year4=c(5,10,4,NA,NA),
year5=c(6,12,1,NA,9))