I'm trying to develop a deeper understanding of using the dot (".") with dplyr
and using the .data
pronoun with dplyr
. The code I was writing that motivated this post, looked something like this:
cat_table <- tibble(
variable = vector("character"),
category = vector("numeric"),
n = vector("numeric")
)
for(i in c("cyl", "vs", "am")) {
cat_stats <- mtcars %>%
count(.data[[i]]) %>%
mutate(variable = names(.)[1]) %>%
rename(category = 1)
cat_table <- bind_rows(cat_table, cat_stats)
}
# A tibble: 7 x 3
variable category n
<chr> <dbl> <dbl>
1 cyl 4 11
2 cyl 6 7
3 cyl 8 14
4 vs 0 18
5 vs 1 14
6 am 0 19
7 am 1 13
The code does what I wanted it to do and isn’t really the focus of this question. I was just providing it for context.
I'm trying to develop a deeper understanding of why it does what I want it to do. And more specifically, why I can't use .
and .data
interchangeably. I've read the Programming with dplyr article, but I guess in my mind, both .
and .data
just mean "our result up to this point in the pipeline." But, it appears as though I'm oversimplifying my mental model of how they work because I get an error when I use .data
inside of names()
below:
mtcars %>%
count(.data[["cyl"]]) %>%
mutate(variable = names(.data)[1])
Error: Problem with `mutate()` input `variable`.
x Can't take the `names()` of the `.data` pronoun
ℹ Input `variable` is `names(.data)[1]`.
Run `rlang::last_error()` to see where the error occurred.
And I get an unexpected (to me) result when I use .
inside of count()
:
mtcars %>%
count(.[["cyl"]]) %>%
mutate(variable = names(.)[1])
.[["cyl"]] n variable
1 4 11 .[["cyl"]]
2 6 7 .[["cyl"]]
3 8 14 .[["cyl"]]
I suspect it has something to do with, "Note that .data is not a data frame; it’s a special construct, a pronoun, that allows you to access the current variables either directly, with .data$x or indirectly with .data[[var]]. Don’t expect other functions to work with it," from the Programming with dplyr article. This tells me what .data
isn't -- a data frame -- but, I'm still not sure what .data
is and how it differs from .
.
I tried figuring it out like this:
mtcars %>%
count(.data[["cyl"]]) %>%
mutate(variable = list(.data))
But, the result <S3: rlang_data_pronoun>
doesn't mean anything to me that helps me understand. If anybody out there has a better grasp on this, I would appreciate a brief lesson. Thanks!
Up front, I think .data
's intent is a little confusing until one also considers its sibling pronoun, .env
.
The dot .
is something that magrittr::%>%
sets up and uses; since dplyr
re-exports it, it's there. And whenever you reference it, it is a real object, so names(.)
, nrow(.)
, etc all work as expected. It does reflect data up to this point in the pipeline.
.data
, on the other hand, is defined within rlang
for the purpose of disambiguating symbol resolution. Along with .env
, it allows you to be perfectly clear on where you want a particular symbol resolved (when ambiguity is expected). From ?.data
, I think this is a clarifying contrast:
disp <- 10
mtcars %>% mutate(disp = .data$disp * .env$disp)
mtcars %>% mutate(disp = disp * disp)
However, as stated in the help pages, .data
(and .env
) is just a "pronoun" (we have verbs, so now we have pronouns too), so it is just a pointer to explain to the tidy internals where the symbol should be resolved. It's just a hint of sorts.
So your statement
both
.
and.data
just mean "our result up to this point in the pipeline."
is not correct: .
represents the data up to this point, .data
is just a declarative hint to the internals.
Consider another way of thinking about .data
: let's say we have two functions that completely disambiguate the environment a symbol is referenced against:
get_internally
, this symbol must always reference a column name, it will not reach out to the enclosing environment if the column does not exist; andget_externally
, this symbol must always reference a variable/object in the enclosing environment, it will never match a column.In that case, translating the above examples, one might use
disp <- 10
mtcars %>%
mutate(disp = get_internally(disp) * get_externally(disp))
In that case, it seems more obvious that get_internally
is not a frame, so you can't call names(get_internally)
and expect it to do something meaningful (other than NULL
). It'd be like names(mutate)
.
So don't think of .data
as an object, think of it as a mechanism to disambiguate the environment of the symbol. I think the $
it uses is both terse/easy-to-use and absolutely-misleading: it is not a list
-like or environment
-like object, even if it is being treated as such.
BTW: one can write any S3 method for $
that makes any classed-object look like a frame/environment:
`$.quux` <- function(x, nm) paste0("hello, ", nm, "!")
obj <- structure(0, class = "quux")
obj$r2evans
# [1] "hello, r2evans!"
names(obj)
# NULL
(The presence of a $
accessor does not always mean the object is a frame/env.)