Search code examples
rtidymodels

Tidymodels Error: Can't rename variables in this context


I recently picked up Tidymodels after having used R for a few months in my school.

I was trying to make my first model using the Titanic Dataset on Kaggle, but ran into some issues when fitting the model. Could someone help me?

titanic_rec <- recipe(Survived ~ Sex + Age + Pclass + Embarked + Family_Size + Name, data = titanic_train) %>%
  step_impute_knn(all_predictors(), k = 3) %>% 
  step_dummy(Sex, Pclass, Embarked, Family_Size, Name) %>% 
  step_interact(~ Sex:Age + Sex:Pclass + Pclass:Age)
  
log_model <- logistic_reg() %>% 
              set_engine("glm") %>% 
              set_mode("classification")

fitted_log_model <- workflow() %>%
                      add_model(log_model) %>%
                      add_recipe(titanic_rec) %>% 
                      fit(data = titanic_train) %>% 
                      pull_workflow_fit() %>% 
                      tidy()

Every feature has a factor data type except Age and Survived which are doubles. The error seems to come about when I include the fit(data = ...) onwards.

Error: Can't rename variables in this context. Run `rlang::last_error()` to see where the error occurred.
24.
stop(fallback)
23.
signal_abort(cnd)
22.
abort("Can't rename variables in this context.")
21.
eval_select_recipes(to_impute, training, info)
20.
impute_var_lists(to_impute = x$terms, impute_using = x$impute_with, training = training, info = info)
19.
prep.step_impute_knn(x$steps[[i]], training = training, info = x$term_info)
18.
prep(x$steps[[i]], training = training, info = x$term_info)
17.
prep.recipe(blueprint$recipe, training = data, fresh = blueprint$fresh)
16.
recipes::prep(blueprint$recipe, training = data, fresh = blueprint$fresh)
15.
blueprint$mold$process(blueprint = blueprint, data = data)
14.
run_mold.recipe_blueprint(blueprint, data)
13.
run_mold(blueprint, data)
12.
mold.recipe(recipe, data, blueprint = blueprint)
11.
hardhat::mold(recipe, data, blueprint = blueprint)
10.
fit.action_recipe(action, workflow = workflow, data = data)
9.
fit(action, workflow = workflow, data = data)
8.
.fit_pre(workflow, data)
7.
fit.workflow(., data = titanic_train)
6.
fit(., data = titanic_train)
5.
is_workflow(x)
4.
validate_is_workflow(x)
3.
pull_workflow_fit(.)
2.
tidy(.)
1.
workflow() %>% add_model(log_model) %>% add_recipe(titanic_rec) %>% fit(data = titanic_train) %>% pull_workflow_fit() %>% tidy()

Solution

  • The posted error comes from step_impute_knn() where the number of neighbors should be specified by with neighbors. Secondly, I would advise against using name as a predictor since it creates a separate dummy variable for each name which would mess with the fit.

    The final error comes in step_interact(). You can't use step_interact(~ Sex:Age) after step_dummy(Sex) becuase there won't be any columns named Sex after step_dummy() is done. Instead it will have Sex_male (since female is part of the intercept). A way to catch all the created dummy variables is to use starts_with() inside step_interact().

    library(tidymodels)
    
    titanic_train <- readr::read_csv("your/path/to/data/train.csv")
    
    titanic_train <- titanic_train %>%
      mutate(Survived = factor(Survived),
             Pclass = factor(Pclass),
             Family_Size = SibSp + Parch + 1)
    
    titanic_rec <- recipe(Survived ~ Sex + Age + Pclass + Embarked + Family_Size, 
                          data = titanic_train) %>%
      step_impute_knn(all_predictors(), neighbors = 3) %>% 
      step_dummy(Sex, Pclass, Embarked) %>% 
      step_interact(~ starts_with("Sex_"):Age + 
                      starts_with("Sex_"):starts_with("Pclass_") + 
                      starts_with("Pclass_"):Age)
      
    log_model <- logistic_reg() %>% 
                  set_engine("glm") %>% 
                  set_mode("classification")
    
    fitted_log_model <- workflow() %>%
                          add_model(log_model) %>%
                          add_recipe(titanic_rec) %>% 
                          fit(data = titanic_train) %>% 
                          pull_workflow_fit() %>% 
                          tidy()
    
    fitted_log_model
    #> # A tibble: 13 x 5
    #>    term                 estimate std.error statistic   p.value
    #>    <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
    #>  1 (Intercept)            3.85      0.921      4.18  0.0000289
    #>  2 Age                    0.0117    0.0226     0.516 0.606    
    #>  3 Family_Size           -0.226     0.0671    -3.36  0.000769 
    #>  4 Sex_male              -2.22      0.886     -2.50  0.0124   
    #>  5 Pclass_X2              1.53      1.16       1.31  0.189    
    #>  6 Pclass_X3             -2.42      0.884     -2.74  0.00615  
    #>  7 Embarked_Q            -0.0461    0.368     -0.125 0.900    
    #>  8 Embarked_S            -0.548     0.243     -2.26  0.0241   
    #>  9 Sex_male_x_Age        -0.0488    0.0199    -2.46  0.0140   
    #> 10 Sex_male_x_Pclass_X2  -1.28      0.879     -1.46  0.144    
    #> 11 Sex_male_x_Pclass_X3   1.48      0.699      2.11  0.0347   
    #> 12 Age_x_Pclass_X2       -0.0708    0.0263    -2.69  0.00714  
    #> 13 Age_x_Pclass_X3       -0.0341    0.0209    -1.63  0.103
    

    Created on 2021-07-01 by the reprex package (v2.0.0)