Search code examples
rdplyrmediansparklyr

Using summarize_all() to find median on sparklyr data


I am trying to get summary statistics for data in r that I pulled using sparklyr, including mean and median. I can get all my stats by manually typing everything out in a dplyr::summarize() step, but would like to know if there is a way to do this using a summarize_all() statement.

Manual attempt works:

test<-data%>%
    dplyr::summarize(count=n(),
                     mean_c1=mean(column1,na.rm=TRUE),
                     mean_c2=mean(column2,na.rm=TRUE),
                     median_c1=percentile(column1,.5),
                     median_c2=percentile(column2,.5))

Summarize_all() attempt works without calling percentile for the median. This gets me count,mean, min, max for my data (vars is a vector of column names)

test<-data%>%
    select(vars)%>%
    dplyr::summarize_all(list(count=~n(),mean=mean, min=min,max=max))

But I get errors when I try to add median into the mix- it no longer recognizes the percentile command, which is a Hive function and not an r/dplyr function. ("Error in inherits(x, "fun_list") : object 'percentile' not found")

test<-data%>%
    select(vars)%>%
    dplyr::summarize_all(list(count=~n(),mean=mean, min=min,max=max,median=percentile),probs=.5)

I tried using quantile instead of percentile (which is how I would do this with a data frame), but it errors out when I call the 'test' table.

Is it possible to get median for a spark table in r using the summarize_all() command? Or will I have to do it more manually?


Solution

  • The glue package solved my problem.

    library(rlang)
    library(glue)
    
    vars<-tbl_vars(data)
    eq3<-glue("percentile({vars},.5)")%>%
        setNames(paste0(vars,"_median"))%>%
        lapply(parse_quosure)
    test<-data%>%
        dplyr::summarize(!!!eq3)