Search code examples
rapache-sparksparklyr

How to correctly use ft_string_indexer and ft_one_hot_encoder for multiple columns in SparkR


I have two questions:

  1. How can I convert multiple categorical variables to a big matrix of dummy variables in spark?

  2. How can I get the correct output with the one_hot_encoder and run a (logistic) regression?

I am stuck on how to use the ft_string_indexer and ft_one_hot_encoder to get the right tbl.

As an example, I have made the current dataframe:

library(sparklyr)
library(tidyverse)

sc <- spark_connect(master="yarn-client", spark_home =Sys.getenv("SPARK_HOME"), app_name = "sparklyr",
                    version = "2.1.2", hadoop_version = "2.6", config = configs)

df <- data.frame(
  a=rep(letters[1:4],5), 
  b=rep(c("one", "two"), 10), 
  y=rbinom(n=20,size=1,prob=0.5))

copy_to(sc, df, "df")

So df currently looks like this:

# Source:   table<df> [?? x 3]
# Database: spark_connection
   a     b         y
   <chr> <chr> <int>
 1 a     one       0
 2 b     two       1
 3 c     one       1
 4 d     two       0
 5 a     one       1
 6 b     two       0
 7 c     one       0
 8 d     two       1
 9 a     one       0
10 b     two       1
# ... with more rows

I run the following sequence of mutations and get the output as:

df2 <- tbl(sc, "df")
df2 %>% 
    sdf_mutate(a_idx = ft_string_indexer(a)) %>% 
    sdf_mutate(b_idx = ft_string_indexer(b)) %>% 
    sdf_mutate(a_vec = ft_one_hot_encoder(a_idx)) %>% 
    sdf_mutate(b_vec = ft_one_hot_encoder(b_idx)) %>% 
    collect()

# A tibble: 20 x 7
   a     b         y a_idx b_idx a_vec     b_vec    
   <chr> <chr> <int> <dbl> <dbl> <list>    <list>   
 1 a     one       0     0     0 <dbl [3]> <dbl [1]>
 2 b     two       1     1     1 <dbl [3]> <dbl [1]>
 3 c     one       1     2     0 <dbl [3]> <dbl [1]>
 4 d     two       0     3     1 <dbl [3]> <dbl [1]>
 5 a     one       1     0     0 <dbl [3]> <dbl [1]>
 6 b     two       0     1     1 <dbl [3]> <dbl [1]>
 7 c     one       0     2     0 <dbl [3]> <dbl [1]>
 8 d     two       1     3     1 <dbl [3]> <dbl [1]>
 9 a     one       0     0     0 <dbl [3]> <dbl [1]>
10 b     two       1     1     1 <dbl [3]> <dbl [1]>
11 c     one       1     2     0 <dbl [3]> <dbl [1]>
12 d     two       0     3     1 <dbl [3]> <dbl [1]>
13 a     one       1     0     0 <dbl [3]> <dbl [1]>
14 b     two       0     1     1 <dbl [3]> <dbl [1]>
15 c     one       0     2     0 <dbl [3]> <dbl [1]>
16 d     two       0     3     1 <dbl [3]> <dbl [1]>
17 a     one       0     0     0 <dbl [3]> <dbl [1]>
18 b     two       1     1     1 <dbl [3]> <dbl [1]>
19 c     one       0     2     0 <dbl [3]> <dbl [1]>
20 d     two       0     3     1 <dbl [3]> <dbl [1]>

This output does not seem right to use in the ml_logistic_regression function. Any help on how to optimize the encoding and correct formatting of multiple columns and running a regression on it would be helpful!


Solution

  • The logistic regression classifier requires one column as input so you need to engineer that one column from the encoded a_vec and b_vec. For that you can use the vector assembler like this:

    features_idx = ft_vector_assembler(c("a_vec", "b_vec"))