Search code examples
rtidymodelsr-recipes

Tidymodels: What is the correct way to impute missing values in a Date column?


I struggle a bit with missing values in a Date column. In my pre-processing pipeline (recipe-object) I used the step_impute_knn function to fill missing values in all my Date columns. Unfortunately I got the following error:

Assigned data pred_vals must be compatible with existing data.? Error occurred for column avg_begin_first_contract .x Can't convert double to date

Here is a reprex for a version where I impute values in multiple columns, including a Date column. It did not matter for me, if I imputed values only to the Date column. The result was the same. Below there is a reprex, which does not through an error, because no Datecolumn is used.

Has someone had this issue before?

library(tidyverse)
library(tidymodels)

iris <- iris %>%
  mutate(Plucked = sample(seq(as.Date("1999/01/01"), as.Date("2000/01/01"),
    by = "day"
  ), size = 150))

iris[45, 2] <- as.numeric(NA)
iris[37, 3] <- as.numeric(NA)
iris[78, 4] <- as.numeric(NA)
iris[9, 5] <- as.numeric(NA)
iris[15, 6] <- as.factor(NA)

set.seed(456)

iris_split <- iris %>%
  initial_split(strata = Sepal.Length)


iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 10,
  min_n = 10,
  trees = 500
) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
  data = iris_training
) %>%
  step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species, Plucked) %>%
  step_date(Plucked) %>%
  step_dummy(Species)

iris_workflow <- workflow() %>%
  add_model(iris_rf_model) %>%
  add_recipe(base_rec)

iris_rf_wkfl_fit <- iris_workflow %>%
  last_fit(iris_split)
#> x train/test split: preprocessor 1/1: Error: Assigned data `pred_vals` must be compatible wi...
#> Warning: All models failed. See the `.notes` column.
Created on 2021-06-15 by the reprex package (v2.0.0)

Here is the reprex, which does not through an error:

library(tidyverse)
library(tidymodels)

iris[45, 2] <- as.numeric(NA)
iris[37 ,3] <- as.numeric(NA)
iris[78, 4] <- as.numeric(NA)
iris[9, 5] <- as.numeric(NA)

set.seed(123)

iris_split <- iris %>% 
  initial_split(strata = Sepal.Length)

iris_training <- training(iris_split)
iris_testing <- testing(iris_split)

iris_rf_model <- rand_forest(
  mtry = 5,
  min_n = 5,
  trees = 500) %>%
  set_engine("ranger") %>%
  set_mode("regression")


base_rec <- recipe(Sepal.Length ~ .,
                   data = iris_training) %>% 
  step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species) %>%
  step_dummy(Species)

iris_workflow <- workflow() %>% 
  add_model(iris_rf_model) %>% 
  add_recipe(base_rec)

iris_rf_wkfl_fit <- iris_workflow %>%
  last_fit(split = iris_split)
Created on 2021-06-15 by the reprex package (v2.0.0)

Thanks in advance! M.


Solution

  • I guess I found an answer and want to share it with you. The key was to turn the Date into a numeric value. Then the imputation was easy. Here is a reprex.

    library(tidyverse)
    library(tidymodels)
    
    iris <- iris %>%
      mutate(Plucked = sample(seq(as.Date("1999/01/01"), as.Date("2000/01/01"),
        by = "day"
      ), size = 150))
    
    iris[45, 2] <- as.numeric(NA)
    iris[37, 3] <- as.numeric(NA)
    iris[78, 4] <- as.numeric(NA)
    iris[9, 5] <- as.numeric(NA)
    iris[15, 6] <- as.factor(NA)
    
    set.seed(456)
    
    iris_split <- iris %>%
      initial_split(strata = Sepal.Length)
    
    
    iris_training <- training(iris_split)
    iris_testing <- testing(iris_split)
    
    iris_rf_model <- rand_forest(
      mtry = 10,
      min_n = 10,
      trees = 500
    ) %>%
      set_engine("ranger") %>%
      set_mode("regression")
    
    
    base_rec <- recipe(Sepal.Length ~ .,
      data = iris_training
    ) %>% 
      step_mutate_at(
        where(lubridate::is.Date),
        fn = ~ as.numeric(lubridate::ymd(.x))
      ) %>%
      step_impute_bag(c("Plucked")) %>% 
      step_impute_knn(Sepal.Width, Petal.Length, Petal.Width, Species) %>%
      step_dummy(Species)
    
    iris_workflow <- workflow() %>%
      add_model(iris_rf_model) %>%
      add_recipe(base_rec)
    
    iris_rf_wkfl_fit <- iris_workflow %>%
      last_fit(iris_split)
    #> ! train/test split: preprocessor 1/1, model 1/1: 10 columns were requested but there were 6 ...
    Created on 2021-06-29 by the reprex package (v2.0.0)
    

    If you want to revert from numerics back to Dates before the fitting, you can do so by adding the following line to your code:

    step_mutate_at(c("Plucked"), fn = ~ as.Date(.x, origin = "1970-01-01 UTC"))
    

    Thanks again, M.