Search code examples
rglm

predict some missing values based on linear model


enter image description here

datax <- matrix(1:32, nrow = 8)
datax[2:5,1] <- NA
m <- data.frame(datax)
names(m)[c(1:4)] <- c("Length", "Width", "sex", "height")
model <- glm(Length ~ Width + sex + height, data = m)

How do you predict the NA values based on the model (code just given as example)

I've got a dataset with 15 variables and the response variable has some missing values. How can I predict the missing values of the response variable based on a linear model built from this dataset?


Solution

  • How about subsetting your data into parts with and without missing values, creating a linear model based on the latter and imputing the missing values on the former through predict()?

    library(tidyverse)
    
    datax <- matrix(1:32, nrow = 8)
    datax[2:5,1] <- NA
    m <- data.frame(datax)
    names(m)[c(1:4)] <- c("Length", "Width", "sex", "height")
    
    # Creating an index of rows with missing values in "Length"
    missing_index <- which(is.na(m$Length))
    
    # Subsetting rows with missing values
    m_missing <- m[missing_index,]
    
    # Subsetting the rest
    m_rest <- m[-missing_index,]
    
    # Creating a linear model on m_rest and making predictions on m_missing
    model <- lm(Length ~ ., data = m_rest)
    predictions <- predict(model, newdata = m_missing %>% select(-Length))
    
    # Insert missing values into the original dataframe
    m[missing_index, "Length"] <- predictions
    

    Resulting in:

    > print(m)
      Length Width sex height
    1      1     9  17     25
    2      2    10  18     26
    3      3    11  19     27
    4      4    12  20     28
    5      5    13  21     29
    6      6    14  22     30
    7      7    15  23     31
    8      8    16  24     32