Search code examples
rapache-sparkapache-spark-mlsparklyr

Sparklyr handing categorical variables


Sparklyr handling categorical variables

I came from R background and I am used to categorical variables being handled in the backend (as factor). With Sparklyr it is quite confusing using string_indexer or onehotencoder.

For example, I have a number of variables which has been encoded as numerical variables in the original dataset but they are actually categorical. I want to use them as categorical variables but am not sure I am doing it correctly.

library(sparklyr)
library(dplyr)
sessionInfo()
sc <- spark_connect(master = "local", version = spark_version)
spark_version(sc)
set.seed(1)    
exampleDF <- data.frame (ID = 1:10, Resp = sample(c(100:205), 10, replace = TRUE), 
                     Numb = sample(1:10, 10))

example <- copy_to(sc, exampleDF) 
pred <- example %>% mutate(Resp = as.character(Resp)) %>%
                sdf_mutate(Resp_cat = ft_string_indexer(Resp)) %>%
                ml_decision_tree(response = "Resp_cat", features = "Numb") %>%
                sdf_predict()
pred

The prediction from the model is not categorical. See below. Does it mean I also have to convert back from prediction to Resp_cat and then to Resp?

R version 3.4.0 (2017-04-21)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

spark_version(sc)
[1] ‘2.1.1.2.6.1.0’

Source:   table<sparklyr_tmp_74e340c5607c> [?? x 6]
Database: spark_connection
      ID  Numb  Resp Resp_cat id74e35c6b2dbb prediction
     <int> <int> <chr>    <dbl>          <dbl>      <dbl>
 1     1    10   150        8              0   8.000000
 2     2     3   191        4              1   4.000000
 3     3     4   146        9              2   9.000000
 4     4     9   125        5              3   5.000000
 5     5     8   107        2              4   2.000000
 6     6     2   110        1              5   1.000000
 7     7     5   133        3              6   5.333333
 8     8     7   154        6              7   5.333333
 9     9     1   170        0              8   0.000000
10    10     6   143        7              9   5.333333

Solution

  • In general Spark depends on the column metadata when handling categorical data. In your pipeline this is handled by StringIndexer (ft_string_indexer). ML always predict labels, not the original strings. Normally you would use IndexToString transformer which is provided by ft_index_to_string.

    In Spark IndexToString to can use either a provided list of labels or Column metadata. Unfortunately sparklyr implementation is limited in two ways:

    • It can use only metadata, which is not set on prediction column.
    • ft_string_indexer discards trained model so it cannot be used to extract lables.

    It is possible I missed something, but it looks like you'll have to map predictions manually, for example by joining with the transformed data:

    pred %>% 
      select(prediction=Resp_cat, Resp_prediction=Resp) %>% 
      distinct() %>% 
      right_join(pred)
    
    Joining, by = "prediction"
    # Source:   lazy query [?? x 9]
    # Database: spark_connection
       prediction Resp_prediction    ID  Numb  Resp Resp_cat id777a79821e1e
            <dbl>           <chr> <int> <int> <chr>    <dbl>          <dbl>
     1          7             171     1     3   171        7              0
     2          0             153     2    10   153        0              1
     3          3             132     3     8   132        3              2
     4          5             122     4     7   122        5              3
     5          6             198     5     4   198        6              4
     6          2             164     6     9   164        2              5
     7          4             137     7     6   137        4              6
     8          1             184     8     5   184        1              7
     9          0             153     9     1   153        0              8
    10          1             184    10     2   184        1              9
    # ... with more rows, and 2 more variables: rawPrediction <list>,
    #   probability <list>
    

    Explanation:

    pred %>% 
      select(prediction=Resp_cat, Resp_prediction=Resp) %>% 
      distinct() 
    

    creates a mapping from prediction (encoded label) to the original label. We rename Resp_cat to prediction so it can serve as join key, and Resp to Resp_prediction to avoid conflict with the actual Resp.

    Finally we apply right equijoin:

    ... %>%  right_join(pred)
    

    Note:

    You should specify the type of tree:

    ml_decision_tree(
      response = "Resp_cat", features = "Numb",type = "classification")