Search code examples
rtidymodels

why tidymodels changes the value in data prep in R


Assume I have this data set. Please let me know if it is a duplicate but I am confused in this.

library(tidymodels)

mt <- mtcars[,c('mpg', 'hp', 'drat', 'am')]

mt$hp <- as.character(mt$hp)
mt$drat <- as.character(mt$drat)

dp_pipe1=recipe(mpg ~ hp + drat + am,data=mt) %>% 
  update_role(c(hp,
                drat),new_role="to_numeric") %>% 
  step_mutate_at(has_role('to_numeric'), fn= as.numeric)

dp_pipe2=prep(dp_pipe1)
bake(dp_pipe2, NULL)

if you run the last step of bake, you will realise that the value of drat has been changed , in the actual data it was 3.9, 3.9, 3.85 etc but now it is coming like 16, 16, 15 etc. Note I am doing a forced character conversion on mtcars data just to show that I am doing a char to num conversion in the processing of data.

I am sorry if I am mistaken on doc. But unable to understand this. Please help

Note my data has no factors:

EDIT 2:

> glimpse(mt)
Rows: 32
Columns: 4
$ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3…
$ hp   <chr> "110", "110", "93", "110", "175", "105",…
$ drat <chr> "3.9", "3.9", "3.85", "3.08", "3.15", "2…
$ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

if I run this:

dp_pipe1=recipe(mpg ~ hp + drat + am,data=mt) %>% 
  update_role(c(hp,
                drat),new_role="to_numeric") %>% 
  step_mutate_at(has_role('to_numeric'), fn= function(x)as.numeric(as.character(x)))

dp_pipe2=prep(dp_pipe1)
bake(dp_pipe2, NULL)

The code gives right result.

EDIT 1:

I am not sure if it is bug or not, but if we choose fn = function(x)as.numeric(as.character(x)) in the step_mutate_at, it works fine.


Solution

  • For 99% of modeling situations, factor encodings are better than character encodings for qualitative data. For that reason, recipes will convert characters to factors. There is a prep() option (strings_as_factors) to avoid this.

    What you are getting for drat is the integer that is the factor level index.

    Here's an example:

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    
    drat_0 <- mtcars$drat
    drat_1 <- as.character(drat_0)
    drat_2 <- factor(drat_1)
    drat_3 <- as.numeric(drat_2)
    
    tibble(drat_0, drat_1, drat_2, drat_3) %>% str()
    #> tibble [32 × 4] (S3: tbl_df/tbl/data.frame)
    #>  $ drat_0: num [1:32] 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
    #>  $ drat_1: chr [1:32] "3.9" "3.9" "3.85" "3.08" ...
    #>  $ drat_2: Factor w/ 22 levels "2.76","2.93",..: 16 16 15 5 6 1 7 11 17 17 ...
    #>  $ drat_3: num [1:32] 16 16 15 5 6 1 7 11 17 17 ...
    

    Created on 2023-07-18 with reprex v2.0.2