Search code examples
rdplyrinterpolationlag

Interpolating missing data in a dataframe with R


I have a dataframe which is similar to the one below:

Country Ccode Year Happiness Power   
1  France    FR 2000      1000  1000  
2  France    FR 2001        NA    NA
3  France    FR 2002        NA    NA
4  France    FR 2003      1600  2200
5  France    FR 2004        NA    NA
6      UK    UK 2000      1000  1000  
7      UK    UK 2001        NA    NA
8      UK    UK 2002      1000  1000  
9      UK    UK 2003      1000  1000
10     UK    UK 2004      1000  1000 

I have previously used the following code to get the differences:

df <- df %>%
  arrange(country, year) %>%  #sort data
  group_by(country) %>%
  mutate_if(is.numeric, funs(d = . - lag(.)))

I would like expand on this code by calculating the difference between the data points of Happiness and Power, divide it by the difference in years between the data points and calculate the values to replace the NA's with, resulting in the following output.

Country Ccode Year Happiness Power   
1  France    FR 2000      1000  1000  
2  France    FR 2001      1200  1400    
3  France    FR 2002      1400  1800
4  France    FR 2003      1600  2200
5  France    FR 2004        NA    NA
6      UK    UK 2000      1000  1000  
7      UK    UK 2001        0      0
8      UK    UK 2002      1000  1000  
9      UK    UK 2003      1000  1000
10     UK    UK 2004      1000  1000  

What would be an efficient way of carrying out this task?

EDIT: Please note that also France 2004 is NA. The extend function does seem to properly deal with such a situation.

EDIT 2: Adding the group_by(country) seems to mess things up for unknown reasons:It seems that the code is trying to convert a character to a numeric, although I do not really understand why. When I convert the column to character, the error becomes an evaluation error. Any suggestions?

> TRcomplete<-TRcomplete%>%
+     group_by(country) %>%
+     mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) : 
  Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(TRcomplete$F116.s)
> TRcomplete<-TRcomplete%>%
+     group_by(country) %>%
+     mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) : 
  Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.numeric(as.character(TRcomplete$F116.s))
> TRcomplete<-TRcomplete%>%
+     group_by(country) %>%
+     mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) : 
  Column `F116.s` can't be converted from character to numeric
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s))
Error: unexpected ')' in "TRcomplete$F116.s <- as.character(TRcomplete$F116.s))"
> TRcomplete$F116.s <- as.character(TRcomplete$F116.s)
> str(TRcomplete$F116.s)
 chr [1:6984] NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA ...
> TRcomplete<-TRcomplete%>%
+     group_by(country) %>%
+     mutate_at(70:73,~na.fill(.x,"extend"))
Error in mutate_impl(.data, dots) : 
  Evaluation error: need at least two non-NA values to interpolate.

Solution

  • You can use na.fill with fill="extend" from the zoo library

    rapply(df, zoo::na.fill,"integer",fill="extend",how="replace")
      Country Ccode Year Happiness Power
    1  France    FR 2000      1000  1000
    2  France    FR 2001      1200  1400
    3  France    FR 2003      1400  1800
    4  France    FR 2004      1600  2200
    5      UK    UK 2000      1000  1000
    6      UK    UK 2001      1000  1000
    7      UK    UK 2003      1000  1000
    8      UK    UK 2004      1000  1000
    

    EDIT:

    library(tidyverse)
    library(zoo)
    df%>%
      group_by(Country)%>%
      mutate_at(4:5,~na.fill(.x,"extend"))
    
      Country Ccode Year Happiness Power
    1  France    FR 2000      1000  1000
    2  France    FR 2001      1200  1400
    3  France    FR 2003      1400  1800
    4  France    FR 2004      1600  2200
    5      UK    UK 2000      1000  1000
    6      UK    UK 2001      1000  1000
    7      UK    UK 2003      1000  1000
    8      UK    UK 2004      1000  1000
    

    If all the elements in the group are NA then:

    df%>% 
      group_by(Country)%>% 
      mutate_if(is.numeric,~if(all(is.na(.x))) NA else na.fill(.x,"extend"))