Search code examples
rdplyrlazy-evaluation

Pass arguments to dplyr functions


I want to parameterise the following computation using dplyr that finds which values of Sepal.Length are associated with more than one value of Sepal.Width:

library(dplyr)

iris %>%
    group_by(Sepal.Length) %>%
    summarise(n.uniq=n_distinct(Sepal.Width)) %>%
    filter(n.uniq > 1)

Normally I would write something like this:

not.uniq.per.group <- function(data, group.var, uniq.var) {
    iris %>%
        group_by(group.var) %>%
        summarise(n.uniq=n_distinct(uniq.var)) %>%
        filter(n.uniq > 1)
}

However, this approach throws errors because dplyr uses non-standard evaluation. How should this function be written?


Solution

  • You need to use the standard evaluation versions of the dplyr functions (just append '_' to the function names, ie. group_by_ & summarise_) and pass strings to your function, which you then need to turn into symbols. To parameterise the argument of summarise_, you will need to use interp(), which is defined in the lazyeval package. Concretely:

    library(dplyr)
    library(lazyeval)
    
    not.uniq.per.group <- function(df, grp.var, uniq.var) {
        df %>%
            group_by_(grp.var) %>%
            summarise_( n_uniq=interp(~n_distinct(v), v=as.name(uniq.var)) ) %>%
            filter(n_uniq > 1)
    }
    
    not.uniq.per.group(iris, "Sepal.Length", "Sepal.Width")
    

    Note that in recent versions of dplyr the standard evaluation versions of the dplyr functions have been "soft deprecated" in favor of non-standard evaluation.

    See the Programming with dplyr vignette for more information on working with non-standard evaluation.