Search code examples
rdplyrnse

NSE on complex expressions with dplyr's do()


Can someone help me understand how NSE works with dplyr when the variable reference is in the form ".$mpg" .

After reading here, I thought using as.name would do it, since I have a character string that gives a variable name.

For example, this works:

mtcars %>% 
summarise_(interp(~mean(var), var = as.name("mpg")))

and this doesn't work:

mtcars %>% 
summarise_(interp(~mean(var), var = as.name(".$mpg")))

but this does:

mtcars %>% 
 summarise(mean(.$mpg)) 

and so does this:

mtcars %>%
summarise(mean(mpg)) 

I want to be able to specify the variable in the form .$mpg so that I can use it with do() when I don't have the option of specifying a dot for the data like in the following example:

library(dplyr)
library(broom)

mtcars %>% 
 tbl_df() %>% 
 slice(., 1) %>% 
 do(tidy(prop.test(.$mpg, .$disp, p = .50)))
  • chose random variables here to demonstrate how the prop.test function works, please don't interpret this as misuse of the test.

Eventually, I want to turn this into a function like this:

 library(lazyeval)
 library(broom)
 library(dplyr)


p_test <- function(x, miles, distance){
        x %>% 
         tbl_df() %>% 
         slice(., 1) %>% 
         do_(tidy(prop.test(miles, distance, p = .50))) 
  }

p_test(mtcars, ".$mpg", ".$disp")

I originally thought that I would have to do something like: interp(~var, var = as.name(miles) where miles would get replaced with .$mpg, but as I mentioned at the top this does not seem to work.


Solution

  • The reason is that as.name creates an unevaluated variable name, but .$mpg, when used in code, is not a variable name. Rather, it’s a complex expression which is equivalent to:

    `$`(., mpg)
    

    That is, it’s a function call to the function $ with two arguments. Using as.name causes R to subsequently search for a variable with the name `.$mpg` rather than calling the above-described function.

    That’s the explanation of why your attempt doesn’t work. The solution is then relatively straightforward: instead of creating an unevaluated variable name, we need to create an unevaluated function call expression. We can do this in various ways, and I’m going to show two here.

    The first is simply to call parse:

    p_test = function (data, miles, distance) {
        x = parse(text = miles)[[1]]
        n = parse(text = distance)[[1]]
        data %>%
            slice(1) %>%
            do_(interp(~tidy(prop.test(x, n, p = 0.5)), x = x, n = n))
    }
    

    Now you can call p_test(mtcars, '.$mpg', '.$disp') and get the desired result.

    However, a more dplyr-y way of doing the same thing would be to pass unevaluated objects to p_test:

    p_test(mtcars, mpg, disp)
    

    … and we can easily do this with a simple change:

    p_test_ = function (data, var1, var2) {
        data %>%
            slice(1) %>%
            do_(interp(~tidy(prop.test(.$x, .$n, p = 0.5)),
                       x = as.name(var1), n = as.name(var2)))
    }
    
    p_test = function (data, var1, var2) {
        p_test_(data, substitute(var1), substitute(var2))
    }
    

    Now the following two pieces of code both work:

    p_test(mtcars, mpg, disp)
    p_test_(mtcars, 'mpg', 'disp')