I want to calculate grouped percentiles using SparkR. I tried this
library(SparkR)
mtcars_spark %>%
SparkR::groupBy("cyl") %>%
SparkR::summarize(p75 = approxQuantile("mpg", 0.75, 0.01),
p90 = approxQuantile("mpg", 0.90, 0.01),
p99 = approxQuantile("mpg", 0.99, 0.01))
...but, got this error:
unable to find an inherited method for function ‘approxQuantile’ for signature ‘"GroupedData", "character", "numeric", "numeric"’
How can I get the grouped percentiles using SparkR so that the desired output is the same as from the following code:
library(dplyr)
mtcars %>%
group_by(cyl) %>%
summarise(p75 = quantile(mpg, 0.75),
p90 = quantile(mpg, 0.90),
p99 = quantile(mpg, 0.99))
approxQuantile
is a method which operates on Datasets
- it has no variant that work on *GroupedDataset
. If you've enabled Hive support, you use Hive's percentile
UDF:
mtcars_spark %>%
SparkR::groupBy("cyl") %>%
SparkR::summarize(p75 = expr("percentile(mpg, 0.75)"),
p90 = expr("percentile(mpg, 0.90)"),
p99 = expr("percentile(mpg, 0.99)"))
If not you could try gapply
function, but it is likely to be much less efficient.