r apache-spark dplyr apache-spark-sql sparklyr

Can dplyr modify multiple columns of spark DF using a vector?

I'm new working with spark. I would like to multiply a large number of columns of a spark dataframe by values in a vector. So far with mtcars I used a for loop and mutate_at as follows:

library(dplyr)
library(rlang)
library(sparklyr)

sc1 <- spark_connect(master = "local")

mtcars_sp = sdf_copy_to(sc1, mtcars, overwrite = TRUE)

mtcars_cols = colnames(mtcars_sp)
mtc_factors = 0:10 / 10

# mutate 1 col at a time
for (i in 1:length(mtcars_cols)) {
    # set equation and print - use sym() convert a string
    mtcars_eq = quo( UQ(sym(mtcars_cols[i])) * mtc_factors[i])
    # mutate formula - LHS resolves to a string, RHS a quosure
    mtcars_sp = mtcars_sp %>% 
        mutate(!!mtcars_cols[i] := !!mtcars_eq )
}

dbplyr::sql_render(mtcars_sp)
mtcars_sp

This works ok with mtcars. However, it results in nested SQL queries being sent to spark, as shown by the sql_render, and breaks down with many columns. Can dplyr be used to instead send a single SQL query in this case?

BTW, I'd rather not transpose the data as it would be too expensive. Any help would be much appreciated!

Solution

In general you can use great answer by Artem Sokolov

library(glue)

mtcars_sp %>% 
  mutate(!!! setNames(glue("{mtcars_cols} * {mtc_factors}"), mtcars_cols) %>% 
    lapply(parse_quosure))

However if this is input for MLlib algorithms then ft_vector_assembler combined with ft_elementwise_product might be a better fit:

scaled <- mtcars_sp %>% 
  ft_vector_assembler(mtcars_cols, "features") %>% 
  ft_elementwise_product("features", "features_scaled", mtc_factors)

The result can be separated (I wouldn't recommend that if you're going with MLlib) into individual columns with sdf_separate_column:

scaled %>% 
  select(features_scaled) %>% 
  sdf_separate_column("features_scaled", mtcars_cols)