Search code examples
rdataframeplyrmelt

R - adding columns together, depending on column name


I have a dataframe (df1) that includes abundances of different species in each sample:

> SampleID   Sp1   Sp2   Sp3   Sp4   ... Spn
> asb-001      3     0     0    23         9
> asb-002      4    15    10    56        98
> asb-003      8    45     8   453         0
> asb-004      0     5     0     3         6
> asb-005    120    56     0     0         0
...

Each column represents a different species.

I have another dataframe (df2)

Sp     Fam
Sp1   Fam1
Sp2   Fam2
Sp3   Fam1
Sp4   Fam3
Sp5   Fam2
Sp6   Fam1
...

There are fewer Family names than there are Species names.

I would like to add columns together depending on the species name to get a total for the family (e.g. Sp1 + Sp3 + Sp6 = Fam1). I don't want to keep the original column with the species name.

If all goes according to plan, my new dataframe (df3) will look something like this:

> SampleID  Fam1  Fam2  Fam3 
> asb-001     12     0     9 
> asb-002     14    18   112 
> asb-003     28    58    18
> asb-004     10    12    10
> asb-005    142    65     0
...

I could manually go through and add individual species together, but this seems tedious for a large dataset, and likely to produce errors. I have the feeling that I should melt df1, but I'm not confident in the details. Any advice would be appreciated!


Solution

  • Change your original data to be in the long format and join against the species-family mapping. You can then compute summation for each family using group_by. Finally, spread the data back out to be in wide format.

    library( tidyverse )
    
    df1 %>% gather( Sp, Value, -SampleID ) %>%     # Convert to long format
      inner_join( df2 ) %>%                        # Combine with family mapping
      group_by( SampleID, Fam ) %>%                # Work on each sample/family pair
      summarize( ValSum = sum(Value) ) %>%         # Compute the sum across species
      ungroup %>% spread( Fam, ValSum, fill=0 )    # Convert back to wide format
    

    Depending on whether each sample is represented in each family, you may get NA after converting back to wide format. The optional parameter fill=0 takes care of converting these NA to 0.