I'm trying to dplyr::summarise
a dataset (collapse) by different summarise_at
/summarise_if
functions so that I have the same named variables in my output dataset. Example:
library(tidyverse)
data(iris)
iris$year <- rep(c(2000,3000),each=25) ## for grouping
iris$color <- rep(c("red","green","blue"),each=50) ## character column
iris$letter <- as.factor(rep(c("A","B","C"),each=50)) ## factor column
head(iris, 3)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species year color letter
1 5.1 3.5 1.4 0.2 setosa 2000 red A
2 4.9 3.0 1.4 0.2 setosa 2000 red A
3 4.7 3.2 1.3 0.2 setosa 2000 red A
The resulting dataset should look like this:
full
Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
1 setosa 2000 87 6.2 5.8 1.9 A red
2 setosa 3000 84.4 6.1 5.5 1.9 A red
3 versicolor 2000 69.4 33.6 7 4.9 B green
4 versicolor 3000 69.1 32.7 6.8 5.1 B green
5 virginica 2000 73.2 51.1 7.7 6.9 C blue
6 virginica 3000 75.5 50.2 7.9 6.4 C blue
I can achieve this by doing the following which is a bit repetitive:
sums <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Width")), list(sum))
max <- iris %>%
group_by(Species, year) %>%
summarise_at(vars(matches("Length")), list(max))
last <- iris %>%
group_by(Species, year) %>%
summarise_if(is.factor, list(last))
first <- iris %>%
group_by(Species, year) %>%
summarise_if(is.character, list(first))
full <- full_join(sums, max) %>% full_join(last) %>% full_join(first)
I have found similar approaches below but can't figure out the approach I've tried here. I would prefer not to make my own function as I think something like this is cleaner by passing everything through a pipe and joining:
test <- iris %>%
#group_by(.vars = vars(Species, year)) %>% #why doesnt this work?
group_by_at(.vars = vars(Species, year)) %>% #doesnt work
{left_join(
summarise_at(., vars(matches("Width")), list(sum)),
summarise_at(., vars(matches("Length")), list(max)),
summarise_if(., is.factor, list(last)),
summarise_if(., is.character, list(first))
)
} #doesnt work
This doesnt work, any suggestions or other approaches?
Helpful: How can I use summarise_at to apply different functions to different columns? Summarize different Columns with different Functions Using dplyr summarize with different operations for multiple columns
By default, the dplyr::left_join()
function only accepts two data frames. If you want to use this function with more than two data frames, you can iterate it with the Reduce
function (base R function):
iris %>%
group_by(Species, year) %>%
{
Reduce(
function(x, y) left_join(x, y),
list(
summarise_at(., vars(matches("Width")), base::sum),
summarise_at(., vars(matches("Length")), base::max),
summarise_if(., is.factor, dplyr::last),
summarise_if(., is.character, dplyr::first)
))
}
# Species year Sepal.Width Petal.Width Sepal.Length Petal.Length letter color
# <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
# 1 setosa 2000 87 6.2 5.8 1.9 A red
# 2 setosa 3000 84.4 6.1 5.5 1.9 A red
# 3 versicolor 2000 69.4 33.6 7 4.9 B green
# 4 versicolor 3000 69.1 32.7 6.8 5.1 B green
# 5 virginica 2000 73.2 51.1 7.7 6.9 C blue
# 6 virginica 3000 75.5 50.2 7.9 6.4 C blue
Furthermore, notice I had to call functions from its package by using ::
in order to avoid name overlapping with previously created data frames.