Search code examples
rdplyr

Why is it wrong to use assignment inside of dplyr function R


I commonly see the mistake with my students the use of the assignment <- inside of dplyr functions. This results in the column name to be the assignment call.

library(dplyr)
iris |> 
  summarise(avg_petal_length <- mean(Petal.Length)) 
#>   avg_petal_length <- mean(Petal.Length)
#> 1                                  3.758

I believe this behavior stems from the use of assignment in base R dollar assign notation

iris$petal_length_one <- iris$petal_length + 1

How should I go about explaining this behavior to my students?


Solution

  • The steps through the logic of how this happens in terms of the non-standard evaluation that dplyr uses are likely too complex for a beginner R class. A straightforward explanation for an R user who has some knowledge of the basics might be something like this:


    Any expressions inside summarise are evaluated to get the value(s) that will be written into the column(s) of the resulting data frame. Typically, these expressions will be passed as named arguments so that we can control column names:

    iris |> summarise(a = pi/2)
    #>          a
    #> 1 1.570796
    

    If the expression is passed as an unnamed argument, then summarise will capture the expression, convert it into a string, and use that for the column name. This is in addition to evaluating it for use as a value in the column.

    iris |> summarise(pi/2)
    #>       pi/2
    #> 1 1.570796
    

    The reason why we don't just get an error when we use assignment inside summarise is that assignment silently returns the assigned value:

    (a <- 32) == 32
    #> [1] TRUE
    

    So in your example, the expression

    avg_petal_length <- mean(Petal.Length)
    

    is evaluated (using the data mask so that Petal.Length is recognised as a column in the iris dataframe), to give the summary value for the column (3.758), but it is also captured to create the name of the column.

    The learning points here for R beginners are

    1. Assignment silently returns the assigned value
    2. Tidyverse functions work differently from most R functions due to sophisticated use of non-standard evaluation. This makes many tasks easier, but one needs to learn the syntax rather than applying learning from base R.
    3. If you want to create a new column inside summarise, you must use the = operator rather than the <- operator