I'm wondering the way R is evaluating several across
in the same summarise
inside a dplyr piping. Consider the following example:
data(iris)
iris_summary <- iris %>%
group_by(Species) %>%
summarise(
across(
.cols = starts_with("Sepal"),
.fns = mean
),
across(
.cols = starts_with("Petal"),
.fns = ~ .x[which.max(Sepal.Length)]
)
)
The outcome produce is not the same as following code:
iris_summary_2 <- iris %>%
group_by(Species) %>%
summarise(
across(
.cols = starts_with("Petal"),
.fns = ~ .x[which.max(Sepal.Length)]
),
across(
.cols = starts_with("Sepal"),
.fns = mean
)
)
Is it a problem need to the timing R is evaluating two across
in the same summarise
? See image below:
I expected R was re-starting from step 0 before evaluating both step 1 and step 2, but the results seems indicate that, in step 2, R is taking the vector Sepal.Length
from step 1 and not from step 0 (previous piping step).
Anyone has tips to force R to take the vector from step 0 without changing code structure?
Yes, summarize
, like mutate
and tibble
, works sequentially and will use the most recently-updated version of any variables.
mtcars |>
summarize(gear = mean(gear),
gear2 = mean(gear) * 100)
gear gear2
1 3.6875 368.75
You might consider using the .names
argument to put your summary numbers in new variables that don't alter the original ones.
iris %>%
group_by(Species) %>%
summarise(
across(
.cols = starts_with("Sepal"),
.fns = mean,
.names = "{.col}_mean"
),
across(
.cols = starts_with("Petal"),
.fns = ~ .x[which.max(Sepal.Length)],
.names = "{.col}_max_Sepal"
)
)
# A tibble: 3 × 5
Species Sepal.Length_mean Sepal.Width_mean Petal.Length_max_Sepal Petal.Width_max_Sepal
<fct> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.2 0.2
2 versicolor 5.94 2.77 4.7 1.4
3 virginica 6.59 2.97 6.4 2