I have sample dataframe df in R and rd_3 in sparklyr. I want to create visit_category column in spark dataframe . I know we can use Cut function in R to create same column , but how do I create same in sparklyr ?
For reproducible purpose
df<-data.frame(visit_duration=c(12,20,70,100),city=c("X","X","X","X"),visit_category=c("0-15","15-25","25-80","80-120"))
rd_3<-copy_to(sc,df)
I cannot use ifelse statements as number of bins is more than 50 . I used ft_bucketlizer in sparklyr ,but it showed an error as given below
rd_3 %>%
ft_bucketizer("visit_duration", "Visit_Category", splits = c(0, 15, 25, 80 , 120)) %>%
mutate(Visit_Category = factor(Visit_Category, labels = c("0-15","15-25","25-80","80-120")))
this is the error I get
Error: org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'AS' expecting {')', ','}(line 1, pos 98)
== SQL ==
SELECT `new_col`, `visit_duration`, FACTOR(`Visit_Category`, ("0-15",
"15-25", "25-80", "80-120") AS "labels") AS `Visit_Category`
In addition: Warning message:
Named arguments ignored for SQL FACTOR
There are no factors
or equivalent types in Spark SQL. Instead, if needed, Spark ML transformers add special column metadata.
As the result factor
call is interpreted as a remote functions and passed through SQL translation engine, rendering complete gibberish.
Now, assuming that you really want to go with bucketizer you'll have to bucketize
splits <- c(0, 15, 25, 80, 120)
bucketized <- rd_3 %>%
ft_bucketizer("visit_duration", "Visit_Category", splits = splits)
create a reference table:
ref <- copy_to(sc, tibble(
Visit_Category = seq_along(splits[-1]) - 1,
label = paste0(
splits[-length(splits)],
"-",
splits[-1]
)
))
and join:
bucketized %>% left_join(ref, by = "Visit_Category")
# Source: spark<?> [?? x 4]
visit_duration city Visit_Category label
<dbl> <chr> <dbl> <chr>
1 12 X 0 0-15
2 20 X 1 15-25
3 70 X 2 25-80
4 100 X 3 80-120
Though it might be easier to just construct CASE WHEN
expression like this one:
library(rlang)
expr <- purrr::map2(
splits[-length(splits)], splits[-1],
function(lo, hi)
glue::glue("visit_duration %BETWEEN% {lo} %AND% {hi} ~ '{lo}-{hi}'")
) %>%
glue::glue_collapse(sep=",\n") %>%
paste("case_when(\n", ., ")")
rd_3 %>% mutate(result = !!parse_quo(expr, env = caller_frame()))
# Source: spark<?> [?? x 4]
visit_duration city visit_category result
<dbl> <chr> <chr> <chr>
1 12 X 0-15 0-15
2 20 X 15-25 15-25
3 70 X 25-80 25-80
4 100 X 80-120 80-120
or simply take Cartesian product with reference and filter the results:
ref2 <- copy_to(sc, tibble(
lo = splits[-length(splits)],
hi = splits[-1]
))
cross_join(rd_3, ref2, explicit=TRUE) %>%
filter(visit_duration >= lo & visit_duration < hi) %>%
mutate(label = paste0(lo, "-", hi)) %>%
select(-lo, -hi)
# Source: spark<?> [?? x 6]
visit_duration city visit_category lo hi label
<dbl> <chr> <chr> <dbl> <dbl> <chr>
1 12 X 0-15 0 15 0.0-15.0
2 20 X 15-25 15 25 15.0-25.0
3 70 X 25-80 25 80 25.0-80.0
4 100 X 80-120 80 120 80.0-120.0