Search code examples
rapache-sparksparklyr

What aggregation functions can be used with sdf_pivot in sparklyr?


Trying to use sdf_pivot with the development version of sparklyr. The only aggregation function that seems to work is count. If I try sum or avg I get an exception stating No matched method found for class org.apache.spark.sql.RelationalGroupedDataset.sum

Here is some code to reproduce:

iris_tbl <- copy_to(sc, iris)
iris_tbl %>% sdf_pivot(Species ~ Sepal_Width) # this works
iris_tbl %>% sdf_pivot(Species ~ Sepal_Width, "sum") # this doesn't 

Solution

  • I believe that this is still undocumented but the reason you are getting this error is that you'll need to use the sdf_pivot function with an R list or R function for aggregation.

    Here is some examples :

    Using R list:

    > iris_tbl %>% sdf_pivot(Species ~ Sepal_Width, list(Sepal_Width="sum")) %>% head()
    # Source:   lazy query [?? x 24]
    # Database: spark_connection
         Species `2.0` `2.2` `2.3` `2.4` `2.5` `2.6` `2.7` `2.8` `2.9` `3.0` `3.1` `3.2` `3.3`
           <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1 versicolor     2   4.4   6.9   7.2    10   7.8  13.5  16.8  20.3    24   9.3   9.6   3.3
    2  virginica   NaN   2.2   NaN   NaN    10   5.2  10.8  22.4   5.8    36  12.4  16.0   9.9
    3     setosa   NaN   NaN   2.3   NaN   NaN   NaN   NaN   NaN   2.9    18  12.4  16.0   6.6
    # ... with 10 more variables: `3.4` <dbl>, `3.5` <dbl>, `3.6` <dbl>, `3.7` <dbl>,
    #   `3.8` <dbl>, `3.9` <dbl>, `4.0` <dbl>, `4.1` <dbl>, `4.2` <dbl>, `4.4` <dbl>
    

    Using R function:

    > sum_sepal_width <- function(gdf) {
      expr <- invoke_static(
              sc,
              "org.apache.spark.sql.functions",
              "expr",
              "sum(Sepal_Width)"
          )
    
       gdf %>% invoke("agg", expr, list())
    }
    
    > iris_tbl %>% sdf_pivot(Species ~ Sepal_Width, fun.aggregate = fun.aggregate)
    # Source:   table<sparklyr_tmp_4ee61c86311c> [?? x 24]
    # Database: spark_connection
         Species `2.0` `2.2` `2.3` `2.4` `2.5` `2.6` `2.7` `2.8` `2.9` `3.0` `3.1` `3.2` `3.3`
           <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
    1 versicolor     2   4.4   6.9   7.2    10   7.8  13.5  16.8  20.3    24   9.3   9.6   3.3
    2  virginica   NaN   2.2   NaN   NaN    10   5.2  10.8  22.4   5.8    36  12.4  16.0   9.9
    3     setosa   NaN   NaN   2.3   NaN   NaN   NaN   NaN   NaN   2.9    18  12.4  16.0   6.6
    # ... with 10 more variables: `3.4` <dbl>, `3.5` <dbl>, `3.6` <dbl>, `3.7` <dbl>,
    #   `3.8` <dbl>, `3.9` <dbl>, `4.0` <dbl>, `4.1` <dbl>, `4.2` <dbl>, `4.4` <dbl>
    

    Note: sdf_pivot is unavailable before sparklyr-0-6-0-unreleased.