Search code examples
rdplyrparquetapache-arrowtidyeval

How to write a function with tidy eval when using the "arrow" R package (arrow::open_dataset) and dplyr verbs?


What I'm trying to do

I'm attempting to write a function that uses dplyr verbs and that takes an "arrow open dataset" as the first argument, and a column in that dataset as the second argument. Since I would like to pass the column as a string (necessary for the context of my actual task I'm working on, i.e. Shiny), I'm using the syntax .data[[.column]]. Below is an image of the error I'm getting and some code to reproduce said error. Any help or insight is appreciated.

Image of error message

enter image description here

Code to reproduce error

# install.packages(c("dplyr", "ggplot2", "arrow"))
library(dplyr)

arrow::write_parquet(x = ggplot2::mpg, sink = "sample_data.parquet")

dat <- arrow::open_dataset("sample_data.parquet")

glimpse(dat)

get_metric <- function(.data, .metric) {
  
  .data %>%
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(.data[[.metric]], na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric(dat, "cty") %>% collect()

Additional code that works but doesn't use arrow as much so not ideal for speed

In this code I collect before the tidy eval stuff so its just essentially regular dplyr code. It runs, but is a slower than code that I've successfully gotten to run before extracting stuff into said function.

get_metric2 <- function(.data, .metric) {
  
  .data %>%
    collect() %>% 
    group_by(manufacturer, cyl) %>% 
    summarize(
      new_col = sum(.data[[.metric]], na.rm = T)
    ) %>% 
    ungroup() 
}

get_metric2(dat, "cty")

Solution

  • Use the !! nomenclature.

    arrow::write_parquet(x = ggplot2::mpg, sink = "sample_data.parquet")
    dat <- arrow::open_dataset("sample_data.parquet")
    
    get_metric <- function(.data, .metric) {
      .metric <- rlang::sym(.metric)
      .data %>%
        group_by(manufacturer, cyl) %>% 
        summarize(
          new_col = sum(!!.metric, na.rm = T)
        ) %>% 
        ungroup() 
    }
    
    get_metric(dat, "cty") %>%
      collect()
    # # A tibble: 32 × 3
    #    manufacturer   cyl new_col
    #    <chr>        <int>   <int>
    #  1 audi             4     153
    #  2 audi             6     148
    #  3 audi             8      16
    #  4 chevrolet        8     191
    #  5 chevrolet        4      41
    #  6 chevrolet        6      53
    #  7 dodge            4      18
    #  8 dodge            6     225
    #  9 dodge            8     243
    # 10 ford             8     197
    # # ℹ 22 more rows
    # # ℹ Use `print(n = ...)` to see more rows