Search code examples
rdplyrgroupingmultiple-columnsimputation

Imputing missing values in a grouped dataframe


I am imputing missing values within a grouped dataframe. Inside DF the missing values are randomized for Var1 & Var2.

The dataframe is grouped by variables Factory:MachineNum. The imputation is done by the order of Odometer within these groupings.

The code works perfectly about 5-10% of the time. The other 90-95% of the time it says;

"Error: Column Impute must be length 50 (the group size) or one, not 49".

I think it may have something to do with the randomness of the missing values. Perhaps when at least 1 row shares 2 missing values.

How can this code be made more robust?

By running the entire code multiple times you will see it works about 5 - 10% of attempts, and the Results dataframe will eventually be produced.

library(dplyr)
library(tidyr)

# Create dataframe with some missing values in Var1 and Var2
DF <- data.frame(Factory = c(replicate(150,"Factory_A"), replicate(150,"Factory_B")),
                 MachineNum = c(replicate(100,"Machine01"), replicate(100,"Machine02"), replicate(100,"Machine03")),
                 Odometer = c(replicate(1,sample(1:1000,100,rep=FALSE)), replicate(1,sample(5000:7000,100,rep=FALSE)), replicate(1,sample(10000:11500,100,rep=FALSE))),
                 Var1 =c(replicate(1, sample(c(2:10, NA), 100, rep = TRUE)), replicate(1, sample(c(15:20, NA), 100, rep = TRUE)), replicate(1, sample(c(18:24, NA), 100, rep = TRUE))),
                 Var2 = c(replicate(1, sample(c(110:130, NA), 100, rep = TRUE)), replicate(1, sample(c(160:170, NA), 100, rep = TRUE)), replicate(1, sample(c(220:230, NA), 100, rep = TRUE)))
)




# Variables with missing values that need imputing
cols <- grep('Var', names(DF), value = TRUE)

# Group-wise impution of missing values
library(stinepack)
Models <- DF %>%
  pivot_longer(cols = starts_with('Var')) %>%
  arrange(Factory, MachineNum, name, Odometer) %>%
  group_by(Factory, MachineNum, name) %>%
  mutate(Impute = na.stinterp(value, along = time(Odometer), na.rm = TRUE))



# Convert results from long to wide to visually inspect
Results <- Models %>% 
  group_by(Factory, MachineNum, name) %>% 
  mutate(row = row_number()) %>% 
  tidyr::pivot_wider(names_from = name, values_from = c(value, Impute))

Solution

  • The erorr happens when you have leading and trailing NA's in a group and since you have na.rm = TRUE it removes them making the group unbalanced.

    If you keep na.rm as FALSE it would keep NA's as NA and run without error.

    library(dplyr)
    library(stinepack)
    
    DF %>%
      pivot_longer(cols = starts_with('Var')) %>%
      arrange(Factory, MachineNum, name, Odometer) %>%
      group_by(Factory, MachineNum, name) %>%
      mutate(Impute = na.stinterp(value, along = time(Odometer), na.rm = FALSE))