Search code examples
rdataframemachine-learningone-hot-encodingdummy-variable

Hot encoding for a set of columns in R


I am trying to do hot encoding for a subset of df columns in R,

One hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction by converting string columns to binary columns for each string in that column.

Supose we are having a df that looks like this:

mes          work_location  birth_place
01/01/2000      China           Chile
01/02/2000      Mexico           Japan
01/03/2000      China            Chile
01/04/2000      China           Argentina
01/05/2000      USA              Poland
01/06/2000      Mexico           Poland
01/07/2000      USA              Finland
01/08/2000      USA              Finland
01/09/2000      Japan             Norway
01/10/2000      Japan             Kenia
01/11/2000      Japan              Mali
01/12/2000      India              Mali

Here's the code to hot encode :

## function to hot-encode ##
columna_dummy <- function(df, columna) {
  df %>% 
    mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>% 
    mutate(valor = 1) %>% 
    spread(key = columna, value = valor, fill = 0)
}

## selecting columns ##
columnas <- c("work_location", "birth_place")

## applying loop to repeat columna_dummy function for each df column ##


for(i in 1:length(columnas)){
    new_dataset <- columna_dummy(df, i)
   } 

Console output:

Error: Problem with `mutate()` input `mes`.
x objeto '1' no encontrado
i Input `mes` is `(structure(function (..., .x = ..1, .y = ..2, . = ..1) ...`.
Run `rlang::last_error()` to see where the error occurred.
Called from: signal_abort(cnd) 

Column mes it's a date class column, however it is not included into columns atomic vector and it still raises the above error,

Expected output should look somewhat like this for each string in selected string df column:

(I could not add every single column, but work_location_China it's an example of how columns should look)

mes          work_location  birth_place    work_location_China   
01/01/2000      China           Chile              1
01/02/2000      Mexico           Japan             0
01/03/2000      China            Chile             1
01/04/2000      China           Argentina          1
01/05/2000      USA              Poland            0
01/06/2000      Mexico           Poland            0
01/07/2000      USA              Finland           0
01/08/2000      USA              Finland           0
01/09/2000      Japan             Norway           0
01/10/2000      Japan             Kenia            0
01/11/2000      Japan              Mali            0
01/12/2000      India              Mali            0

Is there any other way to apply this loop?


Solution

  • By using purrr library I solved the issue:

    ## data ##
    
    df <- structure(list(mes = c("01/01/2000", "01/02/2000", "01/03/2000", 
    "01/04/2000", "01/05/2000", "01/06/2000", "01/07/2000", "01/08/2000", 
    "01/09/2000", "01/10/2000", "01/11/2000", "01/12/2000"), work_location = c("China", 
    "Mexico", "China", "China", "USA", "Mexico", "USA", "USA", "Japan", 
    "Japan", "Japan", "India"), birth_place = c("Chile", "Japan", 
    "Chile", "Argentina", "Poland", "Poland", "Finland", "Finland", 
    "Norway", "Kenia", "Mali", "Mali")), class = "data.frame", 
    row.names = c(NA, 
    -12L))
    
    ## function to hot-encode ##
    
    columna_dummy <- function(df, columna) {
      df %>% 
        mutate_at(columna, ~paste(columna, eval(as.symbol(columna)), sep = "_")) %>% 
        mutate(valor = 1) %>% 
        spread(key = columna, value = valor, fill = 0)
    }
    
    ## vector of columns ##
    
    columnas <- c("work_location", "birth_place")
    
    
    ## hot_encoded_dataset ##
    library(purrr)
    
    hot_encoded_dataset <- purrr :: map(columnas , columna_dummy, df = df) %>% 
      reduce(inner_join)