I have two questions:
How can I convert multiple categorical variables to a big matrix of dummy variables in spark?
How can I get the correct output with the one_hot_encoder and run a (logistic) regression?
I am stuck on how to use the ft_string_indexer
and ft_one_hot_encoder
to get the right tbl.
As an example, I have made the current dataframe:
library(sparklyr)
library(tidyverse)
sc <- spark_connect(master="yarn-client", spark_home =Sys.getenv("SPARK_HOME"), app_name = "sparklyr",
version = "2.1.2", hadoop_version = "2.6", config = configs)
df <- data.frame(
a=rep(letters[1:4],5),
b=rep(c("one", "two"), 10),
y=rbinom(n=20,size=1,prob=0.5))
copy_to(sc, df, "df")
So df
currently looks like this:
# Source: table<df> [?? x 3]
# Database: spark_connection
a b y
<chr> <chr> <int>
1 a one 0
2 b two 1
3 c one 1
4 d two 0
5 a one 1
6 b two 0
7 c one 0
8 d two 1
9 a one 0
10 b two 1
# ... with more rows
I run the following sequence of mutations and get the output as:
df2 <- tbl(sc, "df")
df2 %>%
sdf_mutate(a_idx = ft_string_indexer(a)) %>%
sdf_mutate(b_idx = ft_string_indexer(b)) %>%
sdf_mutate(a_vec = ft_one_hot_encoder(a_idx)) %>%
sdf_mutate(b_vec = ft_one_hot_encoder(b_idx)) %>%
collect()
# A tibble: 20 x 7
a b y a_idx b_idx a_vec b_vec
<chr> <chr> <int> <dbl> <dbl> <list> <list>
1 a one 0 0 0 <dbl [3]> <dbl [1]>
2 b two 1 1 1 <dbl [3]> <dbl [1]>
3 c one 1 2 0 <dbl [3]> <dbl [1]>
4 d two 0 3 1 <dbl [3]> <dbl [1]>
5 a one 1 0 0 <dbl [3]> <dbl [1]>
6 b two 0 1 1 <dbl [3]> <dbl [1]>
7 c one 0 2 0 <dbl [3]> <dbl [1]>
8 d two 1 3 1 <dbl [3]> <dbl [1]>
9 a one 0 0 0 <dbl [3]> <dbl [1]>
10 b two 1 1 1 <dbl [3]> <dbl [1]>
11 c one 1 2 0 <dbl [3]> <dbl [1]>
12 d two 0 3 1 <dbl [3]> <dbl [1]>
13 a one 1 0 0 <dbl [3]> <dbl [1]>
14 b two 0 1 1 <dbl [3]> <dbl [1]>
15 c one 0 2 0 <dbl [3]> <dbl [1]>
16 d two 0 3 1 <dbl [3]> <dbl [1]>
17 a one 0 0 0 <dbl [3]> <dbl [1]>
18 b two 1 1 1 <dbl [3]> <dbl [1]>
19 c one 0 2 0 <dbl [3]> <dbl [1]>
20 d two 0 3 1 <dbl [3]> <dbl [1]>
This output does not seem right to use in the ml_logistic_regression function. Any help on how to optimize the encoding and correct formatting of multiple columns and running a regression on it would be helpful!
The logistic regression classifier requires one column as input so you need to engineer that one column from the encoded a_vec
and b_vec
. For that you can use the vector assembler like this:
features_idx = ft_vector_assembler(c("a_vec", "b_vec"))