I am new to R and am trying to collapse rows based on row values with dplyr. The following example shows the sample data.
set.seed(123)
df<-data.frame(A=c(rep(1:4,4)),
B=runif(16,min=0,max=1),
C=rnorm(16, mean=1,sd=0.5))
A B c
1 1 0.36647435 0.7485365
2 2 0.51864614 0.8654337
3 3 0.04596929 0.9858012
4 4 0.15479619 1.1294208
5 1 0.76712372 1.2460700
6 2 0.17666676 0.7402996
7 3 0.89759874 1.2699954
8 4 0.90267735 0.7101804
9 1 0.91744223 0.3451281
10 2 0.25472599 0.8604743
11 3 0.10933985 0.8696796
12 4 0.71656017 1.2648846
13 1 0.21157810 1.3170205
14 2 0.14947268 1.2789700
15 3 0.92251060 1.5696901
16 4 0.30090579 1.7642853
I want to summarize/collapse two rows based on the condition that the rows in column A
with values 1 and 2 as one row (as mean of row 1 and 2) . Therefore the final result will have only 12 rows because the other 4 rows has been collapsed.
I tried to use the following dplyr
function but to little avail.
install.packages ("tidyverse") library (tidyverse)
df %>% summarize_each( fun(i){ for i %in% c(1,2)funs(mean) })
The expected output is something like:
A B C
1 1.5 0.4425602 0.8069851
3 3 0.04596929 0.9858012
4 4 0.15479619 1.1294208
5 1.5 0.4718952 0.9931848
7 3 0.89759874 1.2699954
8 4 0.90267735 0.7101804
9 1.5 0.5860841 0.6028012
11 3 0.10933985 0.8696796
12 4 0.71656017 1.2648846
13 1.5 0.1805254 1.297995
15 3 0.92251060 1.5696901
16 4 0.30090579 1.7642853
Thank you in advance.
By making the implicit, order based groupings explicit, the summary can
be done with a single summarise_all
call.
# Generate the data
set.seed(1)
df <- data.frame(
A = c(rep(1:4, 4)),
B = runif(16, min = 0, max = 1),
C = rnorm(16, mean = 1, sd = 0.5)
)
library(dplyr)
new <- df %>%
group_by(grp = rep(
1:4, # vector containing names of groups to create
each = 4 # number of elements in each group
)) %>%
group_by(mean_grp = cumsum(A > 2) + 1, add = T) %>%
summarise_all(mean) %>%
ungroup()
new
#> # A tibble: 12 x 5
#> grp mean_grp A B C
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 1 1.5 0.3188163 1.067598241
#> 2 1 2 3.0 0.5728534 1.755890584
#> 3 1 3 4.0 0.9082078 1.194921618
#> 4 2 1 1.5 0.5500358 0.291014883
#> 5 2 2 3.0 0.9446753 1.562465459
#> 6 2 3 4.0 0.6607978 0.977533195
#> 7 3 1 1.5 0.3454502 1.231911487
#> 8 3 2 3.0 0.2059746 1.410610598
#> 9 3 3 4.0 0.1765568 1.296950661
#> 10 4 1 1.5 0.5355633 1.425278418
#> 11 4 2 3.0 0.7698414 1.037282492
#> 12 4 3 4.0 0.4976992 0.005324152
I would recommend keeping the grouping variables in your data after the
summary (everything is simpler if you include them in the first place),
but if you want to, you can drop them with
new %>% select(-grp, -mean_grp)
.
PS. In order to avoid having "magic numbers" (such as the 1:4
and each = 4
when creating grp
) included in the code, you could also create the first grouping variable as:
grp = cumsum(A < lag(A, default = A[1])) + 1
Assuming that the original data are ordered such that a new group starts each time the value of A
is less than the previous value of A
.