Search code examples
rdplyrtidyverseacross

Tidyverse: Why does selection helper piped to across() throw note/warning/error about external vector when placing helper inside across does not?


Piping (the value of) a selection helper (matches(), contains(), starts_with(), ends_with()) to the across() function behaves differently from putting the selection helper inside the parentheses of across().

  • Why does this happen?
  • Is this the expected behavior or a bug?

Reproduce

library(dplyr)

# Very simple function: returns input
self = function(x){x}

# Data to manipulate
dtemp = tibble(var = 1:2)

# No note/warning/error when selection helper is inside across()
dtemp %>% mutate(across(matches("var"), self))

# Note/warning/error when selection helper is piped to across()
dtemp %>% mutate(matches("var") %>% across(self))

Observed behavior

The last line causes R to print

Note: Using an external vector in selections is ambiguous.
i Use `all_of(.)` instead of `.` to silence this message.
i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.

Note that this message is only printed once per session, so you must restart R to see it again (unless there is some other way to reset the counter that controls this printing).

The penultimate command (with matches() inside across()) does not cause R to print the note.

Expected behavior

The last two commands will behave identically.

Additional info

  • dplyr version: 1.0.6
  • tidyverse version: 1.3.1
> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.6

loaded via a namespace (and not attached):
 [1] fansi_0.5.0      assertthat_0.2.1 utf8_1.2.1       crayon_1.4.1    
 [5] R6_2.5.0         DBI_1.1.1        lifecycle_1.0.0  magrittr_2.0.1  
 [9] pillar_1.6.1     cli_2.5.0        rlang_0.4.11     rstudioapi_0.13 
[13] vctrs_0.3.8      generics_0.1.0   ellipsis_0.3.2   tools_4.0.3     
[17] glue_1.4.2       purrr_0.3.4      compiler_4.0.3   pkgconfig_2.0.3 
[21] tidyselect_1.1.1 tibble_3.1.2    

Solution

  • matches returns a character vector. In a recent dplyr release, all_of was introduced to remove ambiguity. Suppose you have a data.frame with two columns x and y Further, suppose you have a variable x = 'y'. Now if you select(x), do you mean the column x or the column y? all_of(x) would remove this ambiguity. So when you pipe matches to across, it looks like you are selecting a character vector named .. The reason for this is that evaluating a selection helper itself just returns a vector as an argument to .cols. This is the basic principle of R pipes and inputs. You can test this out with rlang::quo. if f <- function(x) rlang::quo(x), then running f(1) is different from running 1 %>% f. First returns a quoted 1 the second returns a quoted .. So, if we evaluate across(<selection-helper>) this is the same as evaluating f(1), while <selection-helper> %>% across() is the same as 1 %>% f(). For across the latter looks like a variable . which has a dedicated environment containing its values (hence it looks like a vector called ..

    To clarify lets look at the outputs as follows:

    library(magrittr)
    f <- function(x) rlang::enquo(x)
    f(1)
    #> <quosure>
    #> expr: ^1
    #> env:  empty
    1 %>% 
        f()
    #> <quosure>
    #> expr: ^.
    #> env:  00000000166DD798
    `%>%`(1,f)
    #> <quosure>
    #> expr: ^.
    #> env:  00000000165176F0
    

    Created on 2021-06-24 by the reprex package (v2.0.0) The first output captures and quotes 1 (which has no environment it is not an object/symbol), while using the pipe command, we store the values of the lhs in an environment with the symbol/name .. The value 1 is now contained in that environment. This is the essence of how the pipe works. Store lhs value(s) in an environment called ., then put it as the first argument in the rhs (unless . is put somewhere else) and evaluate it.

    Hence, it throws the warning as it looks like you are giving a symbol as input, not a value. If you maintain the <selection-help> inside across, it is not a symbol/object it is a character vector, and the character vector does not have ambiguity (because it is not a symbol). The principle is the same as f(1). I hope this gives some clarity. A note is that as soon as we evaluate the input it is no longer . but its real value. You can see this by adding print(x) before quoting x. You can read more about this in the advanced R programing https://adv-r.hadley.nz/ mainly chapter 7 to understand environments in R quoting, promises, and evaluations in the Metaprogramming section.