I have a data with one time column and 2 variables.(example below)
df <- structure(list(time = c(15, 16, 17, 18, 19, 20, 21, 22, 23, 24,
25, 26), var1 = c(20.4, 31.5, NA, 53.7, 64.8, NA, NA, NA, NA,
120.3, NA, 142.5), var2 = c(30.6, 47.25, 63.9, 80.55, 97.2, 113.85,
130.5, 147.15, 163.8, 180.45, 197.1, 213.75)), .Names = c("time",
"var1", "var2"), row.names = c(NA, -12L), class = c("tbl_df",
"tbl", "data.frame"))
The var1 has few NA and I want to fill the NA with linear regression between remaining values in var1 and var2.
Please Help!! And let me know if you need more information
Here is an example using lm
to predict values in R.
library(dplyr)
# Construct linear model based on non-NA pairs
df2 <- df %>% filter(!is.na(var1))
fit <- lm(var1 ~ var2, data = df2)
# See the result
summary(fit)
# Call:
# lm(formula = var1 ~ var2, data = df2)
#
# Residuals:
# 1 2 3 4 5 6
# 8.627e-15 -2.388e-15 1.546e-16 -9.658e-15 -2.322e-15 5.587e-15
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.321e-14 5.619e-15 4.130e+00 0.0145 *
# var2 6.667e-01 4.411e-17 1.511e+16 <2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 7.246e-15 on 4 degrees of freedom
# Multiple R-squared: 1, Adjusted R-squared: 1
# F-statistic: 2.284e+32 on 1 and 4 DF, p-value: < 2.2e-16
#
# Warning message:
# In summary.lm(fit) : essentially perfect fit: summary may be unreliable
# Use fit to predict the value
df3 <- df %>%
mutate(pred = predict(fit, .)) %>%
# Replace NA with pred in var1
mutate(var1 = ifelse(is.na(var1), pred, var1))
# See the result
df3 %>% as.data.frame()
# time var1 var2 pred
# 1 15 20.4 30.60 20.4
# 2 16 31.5 47.25 31.5
# 3 17 42.6 63.90 42.6
# 4 18 53.7 80.55 53.7
# 5 19 64.8 97.20 64.8
# 6 20 75.9 113.85 75.9
# 7 21 87.0 130.50 87.0
# 8 22 98.1 147.15 98.1
# 9 23 109.2 163.80 109.2
# 10 24 120.3 180.45 120.3
# 11 25 131.4 197.10 131.4
# 12 26 142.5 213.75 142.5