I am trying to get summary statistics for data in r that I pulled using sparklyr, including mean and median. I can get all my stats by manually typing everything out in a dplyr::summarize() step, but would like to know if there is a way to do this using a summarize_all() statement.
Manual attempt works:
test<-data%>%
dplyr::summarize(count=n(),
mean_c1=mean(column1,na.rm=TRUE),
mean_c2=mean(column2,na.rm=TRUE),
median_c1=percentile(column1,.5),
median_c2=percentile(column2,.5))
Summarize_all() attempt works without calling percentile for the median. This gets me count,mean, min, max for my data (vars is a vector of column names)
test<-data%>%
select(vars)%>%
dplyr::summarize_all(list(count=~n(),mean=mean, min=min,max=max))
But I get errors when I try to add median into the mix- it no longer recognizes the percentile command, which is a Hive function and not an r/dplyr function. ("Error in inherits(x, "fun_list") : object 'percentile' not found")
test<-data%>%
select(vars)%>%
dplyr::summarize_all(list(count=~n(),mean=mean, min=min,max=max,median=percentile),probs=.5)
I tried using quantile instead of percentile (which is how I would do this with a data frame), but it errors out when I call the 'test' table.
Is it possible to get median for a spark table in r using the summarize_all() command? Or will I have to do it more manually?
The glue package solved my problem.
library(rlang)
library(glue)
vars<-tbl_vars(data)
eq3<-glue("percentile({vars},.5)")%>%
setNames(paste0(vars,"_median"))%>%
lapply(parse_quosure)
test<-data%>%
dplyr::summarize(!!!eq3)