Search code examples
rsparklyr

How to use sdf_pivot() in sparklyr and concatenate strings?


I am trying to use the sdf_pivot() function in sparklyr to "gather" a long format data frame into a wide format. The values of the variables are strings that I would like to concatenate.

Here is a simple example that I think should work but doesn't:

library(sparkylr)
d <- data.frame(id=c("1", "1", "2", "2", "1", "2"), 
                 x=c("200", "200", "200", "201", "201", "201"), 
                 y=c("This", "That", "The", "Other", "End", "End"))
d_sdf <- copy_to(sc, d, "d")
sdf_pivot(d_sdf, id ~ x, paste)

What I'd like it to produce is:

| id | `200`       | `201`           |
|====|=============|=================|
| 1  | "This That" | "End"           |
| 2  | "The"       | "Other End"     |

Unfortunately this gives me an error stating:

Error in as.vector(x, "character") : 
  cannot coerce type 'environment' to vector of type 'character'

I also tried using "collect_list" and that gives me this error:

Error: java.lang.IllegalArgumentException: invalid method collect_list 
 for object 641

Is there a way to do what I'm attempting to do?


Solution

  • i dug into the tests for sdf_pivot and it seems you can use invoke inside a custom fun.aggregate function to access the collect_list function:

     fun.aggregate <- function(gdf) {
    
      expr <- invoke_static(
        sc,
        "org.apache.spark.sql.functions",
        "expr",
        "collect_list(y)" #this is your own "y" variable
      )
    
      gdf %>% invoke("agg", expr, list())
    }
    

    that you can then use in sdf_pivot:

    d_sdf_wide <- sdf_pivot(d_sdf, id ~ x, fun.aggregate)
    

    this does do the job:

    > d_sdf_wide
    Source:     table<sparklyr_tmp_69c14424c5a4> [?? x 3]
    Database:   spark connection master=local[8] app=sparklyr local=TRUE
    
         id      `200`      `201`
      <chr>     <list>     <list>
    1     1 <list [2]> <list [1]>
    2     2 <list [1]> <list [2]>
    

    (your data is now in list format, not a string, but you can concatenate the lists if you like, e.g.

    d_sdf_wide %>% mutate(liststring = paste(`200`))
    
         id      `200`      `201` liststring
      <chr>     <list>     <list>      <chr>
    1     1 <list [2]> <list [1]>  This That
    2     2 <list [1]> <list [2]>        The
    

    (alternatively, you could write a complicated sql query, but i haven't tried)