I have a large R dataframe of which the first 21 columns are abiotic variables (incl. sample names), and column 22-72 are species with relative abundances as values. Due to processing of the data, each sample (i.e., col 1) has multiple rows for all species (with variable rel. ab. values). The abiotic variables of each of that row (of a sample) are identical.
I would like to sum the relative abundance values of each species per sample.
Below, you can find a example original dataframe and the desired outcome.
Original:
df <- data.frame(
sample = c(1,1,1,1,1,1,1,2,2,2,2,2,2,2,3,3,3,3,3,3),
var1 = c(3,3,3,3,3,3,3,7,7,7,7,7,7,7,2,2,2,2,2,2),
var2 = c(4,4,4,4,4,4,4,42,42,42,42,42,42,42,2,2,2,2,2,2),
species1 = c(0,0,0.05,0,0,0.02,0,0,0,0,0,0,0,0,0,0.001,0.02,0.03,0.001,0),
species2 = c(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.001,0.002,0.03,0,0,0)
)
desired outcome:
df_summed <- data.frame(
sample = c(1, 2, 3),
var1 = c(3, 7, 2),
var2 = c(4, 42, 2),
species1 = c(0.07, 0, 0.052),
species2 = c(0, 0, 0.033)
)
I've tried multiple things with dplyr
functions like group_by
and summarise
. For example:
df_summed <- df %>%
group_by(across(1:21)) %>%
summarise(across(22:ncol(df), sum), .groups = "drop")
but this gives me the error
Caused by error in `across()`:
! Can't subset columns past the end.
ℹ Locations 52, 53, 54, …, 71, and 72 don't exist.
ℹ There are only 51 columns.
while the df does have 72 columns... ncol(df)
yields 72
Could anyone assist me how to perform this operation?
After your group_by
, you need to adjust the summarise
columns accordingly:
For your toy data, the number of columns to group on is 3 (1 + 2 vars). The columns to summarise on (excluding these grouping columns) is now 1:2, not 4:5.
df_summed <- df %>%
group_by(across(1:3)) %>%
summarise(across(1:2, sum), .groups = "drop")
# A tibble: 3 × 5
sample var1 var2 species1 species2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 3 4 0.07 0
2 2 7 42 0 0
3 3 2 2 0.052 0.033
Although Sam's answer advises against using numbers to select columns, this answer explains why yours failed.