Search code examples
rdplyrplyrtidyr

Replicating a plyr solution generating quantiles with Hmisc in dplyr


I'm stuck on how to develop a dplyr solution of something that I regularly do in plyr.

Here's the 'plyr' example:

# load packages
if(!require("pacman"))install.packages("pacman")
p_load(dplyr, plyr, Hmisc, tidyverse)

    # generate data
    df_samp <- tibble(
    x_var  = rnorm(100, 0, 1),
    levels = rep(c('a', 'b', 'c', 'd'), 25))

    # working plyr solution that groups data by level and calculates quantiles within levels
    plyr_solution <- plyr::ddply(df_samp,~ levels,
                                 summarise, X = wtd.Ecdf(x_var)$x, 
                                 Y = wtd.Ecdf(x_var)$ecdf)
    plyr_solution

    # dplyr attempt

    dplyr_solution <- df_samp %>% 
    dplyr::select(levels, x_var) %>%
    dplyr::group_by(levels) %>%
    dplyr::mutate(
      X = Hmisc::wtd.Ecdf(x_var)$x,
      Y = Hmisc::wtd.Ecdf(x_var)$ecdf
    )

Appreciate any advice on how to debug the current 'dplyr' attempt or another approach entirely that uses dplyr.


Solution

  • How about this (also requires tidyr though)

    dplyr_solution <- df_samp %>% 
      dplyr::select(levels, x_var) %>%
      dplyr::group_by(levels) %>%
      dplyr::do( X = wtd.Ecdf(.$x_var)$x, 
          Y = wtd.Ecdf(.$x_var)$ecdf) %>% 
      tidyr::unnest()
    

    You can't use mutate as, as it says in the ?mutate, mutate "preserves the number of rows of the input" but you need to change the number of rows

    Edit: Just thought about it a bit more, you don't need tidyr::unnest if you do this:

    dplyr_solution2 <- df_samp %>% 
      dplyr::select(levels, x_var) %>%
      dplyr::group_by(levels) %>%
      dplyr::do( data.frame(X = wtd.Ecdf(.$x_var)$x, 
                 Y = wtd.Ecdf(.$x_var)$ecdf))
    

    Edit no 2: You are write, dplyr::do is mostly depreciated, I was going to suggest a purrr solution but you had specifically requested dplyr. I always assumed group_map was part of purrr (I guess I discovered them at the same time).

    You can essentially just sub out do for group_map with very minor change in syntax:

    dplyr_solution3 <- df_samp %>% 
      dplyr::select(levels, x_var) %>%group_by(levels) %>% 
      dplyr::group_map(~data.frame(X = wtd.Ecdf(.$x_var)$x, 
                            Y = wtd.Ecdf(.$x_var)$ecdf))
    

    Or you can swap to purrr::map_dfr

    purrr_solution <- df_samp %>% 
      dplyr::select(levels, x_var) %>% 
      split(.$levels) %>% 
      purrr::map_dfr(~data.frame(X = wtd.Ecdf(.$x_var)$x, 
                                   Y = wtd.Ecdf(.$x_var)$ecdf), .id = "levels")