Search code examples
rlinear-regressionna

Filling NA using linear regression in R


I have a data with one time column and 2 variables.(example below)

df <- structure(list(time = c(15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 
                              25, 26), var1 = c(20.4, 31.5, NA, 53.7, 64.8, NA, NA, NA, NA, 
                              120.3, NA, 142.5), var2 = c(30.6, 47.25, 63.9, 80.55, 97.2, 113.85, 
                              130.5, 147.15, 163.8, 180.45, 197.1, 213.75)), .Names = c("time", 
                              "var1", "var2"), row.names = c(NA, -12L), class = c("tbl_df", 
                              "tbl", "data.frame"))

The var1 has few NA and I want to fill the NA with linear regression between remaining values in var1 and var2.

Please Help!! And let me know if you need more information


Solution

  • Here is an example using lm to predict values in R.

    library(dplyr)
    
    # Construct linear model based on non-NA pairs
    df2 <- df %>% filter(!is.na(var1))
    
    fit <- lm(var1 ~ var2, data = df2)
    
    # See the result
    summary(fit)
    
    # Call:
    #   lm(formula = var1 ~ var2, data = df2)
    # 
    # Residuals:
    #   1          2          3          4          5          6 
    # 8.627e-15 -2.388e-15  1.546e-16 -9.658e-15 -2.322e-15  5.587e-15 
    # 
    # Coefficients:
    #   Estimate Std. Error   t value Pr(>|t|)    
    # (Intercept) 2.321e-14  5.619e-15 4.130e+00   0.0145 *  
    #   var2        6.667e-01  4.411e-17 1.511e+16   <2e-16 ***
    #   ---
    #   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    # 
    # Residual standard error: 7.246e-15 on 4 degrees of freedom
    # Multiple R-squared:      1,   Adjusted R-squared:      1 
    # F-statistic: 2.284e+32 on 1 and 4 DF,  p-value: < 2.2e-16
    # 
    # Warning message:
    #   In summary.lm(fit) : essentially perfect fit: summary may be unreliable
    
    # Use fit to predict the value
    df3 <- df %>% 
      mutate(pred = predict(fit, .)) %>%
      # Replace NA with pred in var1
      mutate(var1 = ifelse(is.na(var1), pred, var1))
    
    # See the result
    df3 %>% as.data.frame()
    
    #    time  var1   var2  pred
    # 1    15  20.4  30.60  20.4
    # 2    16  31.5  47.25  31.5
    # 3    17  42.6  63.90  42.6
    # 4    18  53.7  80.55  53.7
    # 5    19  64.8  97.20  64.8
    # 6    20  75.9 113.85  75.9
    # 7    21  87.0 130.50  87.0
    # 8    22  98.1 147.15  98.1
    # 9    23 109.2 163.80 109.2
    # 10   24 120.3 180.45 120.3
    # 11   25 131.4 197.10 131.4
    # 12   26 142.5 213.75 142.5