I commonly see the mistake with my students the use of the assignment <-
inside of dplyr
functions. This results in the column name to be the assignment call.
library(dplyr)
iris |>
summarise(avg_petal_length <- mean(Petal.Length))
#> avg_petal_length <- mean(Petal.Length)
#> 1 3.758
I believe this behavior stems from the use of assignment in base R dollar assign notation
iris$petal_length_one <- iris$petal_length + 1
How should I go about explaining this behavior to my students?
The steps through the logic of how this happens in terms of the non-standard evaluation that dplyr
uses are likely too complex for a beginner R class. A straightforward explanation for an R user who has some knowledge of the basics might be something like this:
Any expressions inside summarise
are evaluated to get the value(s) that will be written into the column(s) of the resulting data frame. Typically, these expressions will be passed as named arguments so that we can control column names:
iris |> summarise(a = pi/2)
#> a
#> 1 1.570796
If the expression is passed as an unnamed argument, then summarise
will capture the expression, convert it into a string, and use that for the column name. This is in addition to evaluating it for use as a value in the column.
iris |> summarise(pi/2)
#> pi/2
#> 1 1.570796
The reason why we don't just get an error when we use assignment inside summarise
is that assignment silently returns the assigned value:
(a <- 32) == 32
#> [1] TRUE
So in your example, the expression
avg_petal_length <- mean(Petal.Length)
is evaluated (using the data mask so that Petal.Length
is recognised as a column in the iris
dataframe), to give the summary value for the column (3.758), but it is also captured to create the name of the column.
The learning points here for R beginners are
summarise
, you must use the =
operator rather than the <-
operator