I'm trying to do some parametrised dplyr
manipulations. The simplest reproducible example to express the root of the problem is this:
# Data
test <- data.frame(group = rep(1:5, each = 2),
value = as.integer(c(NA, NA, 2, 3, 3, 5, 7, 8, 9, 0)))
> test
group value
1 1 NA
2 1 NA
3 2 2
4 2 3
5 3 3
6 3 5
7 4 7
8 4 8
9 5 9
10 5 0
# Summarisation example, this is what I'd like to parametrise
# so that I can pass in functions and grouping variables dynamically
test.summary <- test %>%
group_by(group) %>%
summarise(group.mean = mean(value, na.rm = TRUE))
> test.summary
Source: local data frame [5 x 2]
group group.mean
<int> <dbl>
1 1 NaN
2 2 2.5
3 3 4.0 # Correct results
4 4 7.5
5 5 4.5
This is how far I got alone
# This works fine, but notice there's no 'na.rm = TRUE' passed in
doSummary <- function(d_in = data, func = 'mean', by = 'group') {
# d_in: data in
# func: required function for summarising
# by: the variable to group by
# NOTE: the summary is always for the 'value' column in any given dataframe
# Operations for summarise_
ops <- interp(~f(value),
.values = list(f = as.name(func),
value = as.name('value')))
d_out <- d_in %>%
group_by_(by) %>%
summarise_(.dots = setNames(ops, func))
}
> doSummary(test)
Source: local data frame [5 x 2]
group mean(value)
<int> <dbl>
1 1 NA
2 2 2.5
3 3 4.0
4 4 7.5
5 5 4.5
Trying with the 'na.rm' parameter
# When I try passing in the 'na.rm = T' parameter it breaks
doSummary.na <- function(d_in = data, func = 'mean', by = 'group') {
# Doesn't work
ops <- interp(~do.call(f, args),
.values = list(f = func,
args = list(as.name('value'), na.rm = TRUE)))
d_out <- d_in %>%
group_by_(by) %>%
summarise_(.dots = setNames(ops, func))
}
> doSummary.na(test)
Error: object 'value' not found
Many thanks for your help!
Your title mentions ...
but your question doesn’t. If we don’t need to deal with ...
, the answer gets a lot easier, because we don’t need do.call
at all, we can call the function directly; simply replace your ops
definition with:
ops = interp(~f(value, na.rm = TRUE),
f = match.fun(func), value = as.name('value'))
Note that I’ve used match.fun
here instead of as.name
. This is generally a better idea since it works “just like R” for function lookup. As a consequence, you can’t just pass a function name character as an argument but also a function name or an anonymous function:
doSummary.na(test, function (x, ...) mean(x, ...) / sd(x, ...)) # x̂/s?! Whatever.
Speaking of which, your attempt to set the column names also fails; you need to put ops
into a list to fix that:
d_in %>%
group_by_(by) %>%
summarise_(.dots = setNames(list(ops), func))
… because .dots
expects a list of operations (and setNames
also expects a vector/list). However, this code once again won’t work if you’re passing a func
object in to the function that isn’t a character vector. To make this more robust, use something like this:
fname = if (is.character(func)) {
func
} else if (is.name(substitute(func))) {
as.character(substitute(func))
} else {
'func'
}
d_in %>%
group_by_(by) %>%
summarise_(.dots = setNames(list(ops), fname))
Things get more complicated if you actually want to allow passing ...
, instead of known arguments, because (as far as I know) there’s simply no direct way of passing ...
via interp
, and, like you, I cannot get the do.call
approach to work.
The ‹lazyeval› package provides the very nice function make_call
, which helps us on the way to a solution. The above could also be written as
# Not good. :-(
ops = make_call(as.name(func), list(as.name('value'), na.rm = TRUE))
This works. BUT only when func
is passed as a character vector. As explained above, this simply isn’t flexible.
However, make_call
simply wraps base R’s as.call
and we can use that directly:
ops = as.call(list(match.fun(func), as.name('value'), na.rm = TRUE))
And now we can simply pass ...
on:
doSummary = function (d_in = data, func = 'mean', by = 'group', ...) {
ops = as.call(list(match.fun(func), as.name('value'), ...))
fname = if (is.character(func)) {
func
} else if (is.name(substitute(func))) {
as.character(substitute(func))
} else {
'func'
}
d_in %>%
group_by_(by) %>%
summarize_(.dots = setNames(list(ops), fname))
}
To be clear: the same could be achieved using interp
but I think this would require manually building a formula
object from a list, which amounts to doing very much the same as in my solution, and then (redundantly) calling interp
on the result.
I generally find that while ‹lazyeval› is incredibly elegant, in some situations base R provides simpler solutions. In particular, interp
is a powerful substitute
replacement but bquote
, a quite underused base R function, already provides many of the same syntactic benefits. The great benefit of ‹lazyeval› objects is that they carry around their evaluation environment, unlike base R expressions. However, this is simply not always needed.