I have a dataframe (df1) that includes abundances of different species in each sample:
> SampleID Sp1 Sp2 Sp3 Sp4 ... Spn
> asb-001 3 0 0 23 9
> asb-002 4 15 10 56 98
> asb-003 8 45 8 453 0
> asb-004 0 5 0 3 6
> asb-005 120 56 0 0 0
...
Each column represents a different species.
I have another dataframe (df2)
Sp Fam
Sp1 Fam1
Sp2 Fam2
Sp3 Fam1
Sp4 Fam3
Sp5 Fam2
Sp6 Fam1
...
There are fewer Family names than there are Species names.
I would like to add columns together depending on the species name to get a total for the family (e.g. Sp1 + Sp3 + Sp6 = Fam1). I don't want to keep the original column with the species name.
If all goes according to plan, my new dataframe (df3) will look something like this:
> SampleID Fam1 Fam2 Fam3
> asb-001 12 0 9
> asb-002 14 18 112
> asb-003 28 58 18
> asb-004 10 12 10
> asb-005 142 65 0
...
I could manually go through and add individual species together, but this seems tedious for a large dataset, and likely to produce errors. I have the feeling that I should melt df1, but I'm not confident in the details. Any advice would be appreciated!
Change your original data to be in the long format and join against the species-family mapping. You can then compute summation for each family using group_by
. Finally, spread the data back out to be in wide format.
library( tidyverse )
df1 %>% gather( Sp, Value, -SampleID ) %>% # Convert to long format
inner_join( df2 ) %>% # Combine with family mapping
group_by( SampleID, Fam ) %>% # Work on each sample/family pair
summarize( ValSum = sum(Value) ) %>% # Compute the sum across species
ungroup %>% spread( Fam, ValSum, fill=0 ) # Convert back to wide format
Depending on whether each sample is represented in each family, you may get NA
after converting back to wide format. The optional parameter fill=0
takes care of converting these NA
to 0.