Search code examples
rapache-sparkdplyrapache-spark-sqlsparklyr

Can dplyr modify multiple columns of spark DF using a vector?


I'm new working with spark. I would like to multiply a large number of columns of a spark dataframe by values in a vector. So far with mtcars I used a for loop and mutate_at as follows:

library(dplyr)
library(rlang)
library(sparklyr)

sc1 <- spark_connect(master = "local")

mtcars_sp = sdf_copy_to(sc1, mtcars, overwrite = TRUE)

mtcars_cols = colnames(mtcars_sp)
mtc_factors = 0:10 / 10

# mutate 1 col at a time
for (i in 1:length(mtcars_cols)) {
    # set equation and print - use sym() convert a string
    mtcars_eq = quo( UQ(sym(mtcars_cols[i])) * mtc_factors[i])
    # mutate formula - LHS resolves to a string, RHS a quosure
    mtcars_sp = mtcars_sp %>% 
        mutate(!!mtcars_cols[i] := !!mtcars_eq )
}

dbplyr::sql_render(mtcars_sp)
mtcars_sp

This works ok with mtcars. However, it results in nested SQL queries being sent to spark, as shown by the sql_render, and breaks down with many columns. Can dplyr be used to instead send a single SQL query in this case?

BTW, I'd rather not transpose the data as it would be too expensive. Any help would be much appreciated!


Solution

  • In general you can use great answer by Artem Sokolov

    library(glue)
    
    mtcars_sp %>% 
      mutate(!!! setNames(glue("{mtcars_cols} * {mtc_factors}"), mtcars_cols) %>% 
        lapply(parse_quosure))
    

    However if this is input for MLlib algorithms then ft_vector_assembler combined with ft_elementwise_product might be a better fit:

    scaled <- mtcars_sp %>% 
      ft_vector_assembler(mtcars_cols, "features") %>% 
      ft_elementwise_product("features", "features_scaled", mtc_factors)
    

    The result can be separated (I wouldn't recommend that if you're going with MLlib) into individual columns with sdf_separate_column:

    scaled %>% 
      select(features_scaled) %>% 
      sdf_separate_column("features_scaled", mtcars_cols)