Search code examples
rdplyr

Very basic question about the native pipe and tidyverse


I can't figure out what works and what doesn't with the native pipe. Here are 2 examples, that I expect to work, but fail. I guess my problem is that I though it would work like the magrittr pipe.

Is there some variation of option 1 or 2 that achieves the same result as 3?

library(tidyverse)

# 1. What I think will  work, but does not work:
mtcars |> 
  n_distinct(gear)

# 2. What I think will  work, but does not work:
mtcars |> 
  n_distinct(_$gear)

# 3. Does work
mtcars |> 
  pull(gear) |> 
  n_distinct()

EDIT: The answer is that it most likely depends on whether the function expects a vector or a data frame.

Both answers do a good job of answering this question, but @r2evans answer is probably more helpful for others, so I will mark that as the solution.


Solution

  • The difference is that the function must be intended to operate on a data.frame (or frame-like, such as tbl_df), not on a vector. mutate/summarize and such all work on frames, whereas some of dplyr's functions are meant to operate on vectors, whether inside a call to mutate (...) or not. These non-frame functions must be given a vector.

    TLDR: with is a cheater function that supports the non-standard evaluation you're looking for: mtcars |> with(n_distinct(gear)) works (among many other expressions).

    You can distinguish between what verbs can work like you have tried here (unwrapped, so to speak) by checking their args: if the first argument is something like .data= or data= or x= (all expecting a data.frame-like object), then it can be used immediately after |> or %>%. For instance, mutate, summarize, and reframe all have something like this in their help pages:

    Arguments:
    
       .data: A data frame, data frame extension (e.g. a tibble), or a lazy
              data frame (e.g. from dbplyr or dtplyr). See _Methods_,
              below, for more details.
    

    Even tidyr functions (that work on the top-level like that) are similar, with

    Usage:
    
         pivot_wider(
           data,
           ...,
           id_cols = NULL,
           <truncated>
    

    Whereas with n_distinct, all of its arguments:

    Usage:
    
         n_distinct(..., na.rm = FALSE)
         
    Arguments:
    
         ...: Unnamed vectors. If multiple vectors are supplied, then they
              should have the same length.
    

    where its first (and optionally more) argument is a vector.

    I infer the intended use of n_distinct to return an integer, so we can easily adapt your first attempt to get what we need:

    n_distinct(mtcars$gear)
    # [1] 3
    mtcars |> with(n_distinct(gear))
    # [1] 3
    mtcars |>
      summarize(ngears = n_distinct(gear)) |>
      pull(ngears)
    # [1] 3
    

    You asked about dplyr-specific verbs and the pipe, but the notion that the counting function (n_distinct) does not operate on its own is the same with a very similar package data.table, where its verbs need to operate either on a vector or within its [-scope (which is analogous in effect to needing to be within dplyr's verbs):

    data.table::uniqueN(mtcars$gear)
    as.data.table(mtcars)[, uniqueN(gear)]
    # blend of dplyr/data.table
    as.data.table(mtcars)[, n_distinct(gear)]
    

    The biggest reason this is the case is because dplyr and data.table both allow for non-standard evaluation (NSE) of column names. This is supported in a few base R functions (with, withint, subset, and transform come to mind, there are others), but it is prevalent in dplyr and data.table. This (NSE) is how you are able to do something like

    mtcars |>
      summarize(ngears = n_distinct(gear))
    

    and not have to reference mtcars$gear instead of gear. (For a few reasons, mtcars$gear inside of mutate/summarize/... is actively discouraged in dplyr anyway).