r apache-spark apache-spark-sql apache-spark-ml sparklyr

How to create bins in sparklyr?

I have sample dataframe df in R and rd_3 in sparklyr. I want to create visit_category column in spark dataframe . I know we can use Cut function in R to create same column , but how do I create same in sparklyr ?

For reproducible purpose

df<-data.frame(visit_duration=c(12,20,70,100),city=c("X","X","X","X"),visit_category=c("0-15","15-25","25-80","80-120"))

rd_3<-copy_to(sc,df)

I cannot use ifelse statements as number of bins is more than 50 . I used ft_bucketlizer in sparklyr ,but it showed an error as given below

rd_3 %>%
ft_bucketizer("visit_duration", "Visit_Category", splits = c(0, 15, 25, 80 , 120)) %>% 
mutate(Visit_Category = factor(Visit_Category, labels = c("0-15","15-25","25-80","80-120")))

this is the error I get

Error: org.apache.spark.sql.catalyst.parser.ParseException: 
extraneous input 'AS' expecting {')', ','}(line 1, pos 98)

== SQL ==
SELECT `new_col`, `visit_duration`, FACTOR(`Visit_Category`, ("0-15", 
"15-25", "25-80", "80-120") AS "labels") AS `Visit_Category`

In addition: Warning message:
Named arguments ignored for SQL FACTOR

Solution

There are no factors or equivalent types in Spark SQL. Instead, if needed, Spark ML transformers add special column metadata.

As the result factor call is interpreted as a remote functions and passed through SQL translation engine, rendering complete gibberish.

Now, assuming that you really want to go with bucketizer you'll have to bucketize

splits <- c(0, 15, 25, 80, 120)

bucketized <- rd_3 %>%
   ft_bucketizer("visit_duration", "Visit_Category", splits = splits)

create a reference table:

ref <- copy_to(sc, tibble(
  Visit_Category = seq_along(splits[-1]) - 1,
  label = paste0(
    splits[-length(splits)],
    "-",
    splits[-1]
  )
))

and join:

bucketized %>% left_join(ref, by = "Visit_Category")

# Source: spark<?> [?? x 4]
  visit_duration city  Visit_Category label 
           <dbl> <chr>          <dbl> <chr> 
1             12 X                  0 0-15  
2             20 X                  1 15-25 
3             70 X                  2 25-80 
4            100 X                  3 80-120

Though it might be easier to just construct CASE WHEN expression like this one:

library(rlang)

expr <- purrr::map2(
  splits[-length(splits)], splits[-1], 
 function(lo, hi) 
   glue::glue("visit_duration %BETWEEN% {lo} %AND% {hi} ~ '{lo}-{hi}'")
) %>%
  glue::glue_collapse(sep=",\n") %>% 
  paste("case_when(\n", ., ")")

rd_3 %>% mutate(result = !!parse_quo(expr, env = caller_frame()))

# Source: spark<?> [?? x 4]
  visit_duration city  visit_category result
           <dbl> <chr> <chr>          <chr> 
1             12 X     0-15           0-15  
2             20 X     15-25          15-25 
3             70 X     25-80          25-80 
4            100 X     80-120         80-120

or simply take Cartesian product with reference and filter the results:

ref2 <- copy_to(sc, tibble(
  lo = splits[-length(splits)],
  hi = splits[-1]
))

cross_join(rd_3, ref2, explicit=TRUE) %>% 
  filter(visit_duration >= lo & visit_duration < hi) %>%
  mutate(label = paste0(lo, "-", hi)) %>%
  select(-lo, -hi)

# Source: spark<?> [?? x 6]
  visit_duration city  visit_category    lo    hi label     
           <dbl> <chr> <chr>          <dbl> <dbl> <chr>     
1             12 X     0-15               0    15 0.0-15.0  
2             20 X     15-25             15    25 15.0-25.0 
3             70 X     25-80             25    80 25.0-80.0 
4            100 X     80-120            80   120 80.0-120.0