Search code examples
rcountaggregateshared

Count values shared between groups


Here is some dummy data:

class<-c("ab","ab","ad","ab","ab","ad","ab","ab","ad","ab","ad","ab","av")
otu<-c("ab","ac","ad","ab","ac","ad","ab","ac","ad","ab","ad","ac","av")
value<-c(0,1,12,13,300,1,2,3,4,0,0,2,4)
type<-c("b","c","d","a","b","c","d","d","d","c","b","a","a")
location<-c("b","c","d","a","b","d","d","d","d","c","b","a","a")
datafr1<-data.frame(class,otu,value,type,location)

I want to get rid of any OTU if any replicate within the group 'location' and 'type' is 0, because I am interested in OTUs shared between all the replicates within a group.

I want to calculate two things. One: percent abundance of 'value' for all OTUs shared between the group 'location' and type' (abundance) Two: count the number of OTUs shared in each class (otu.freq)

Caveat is that I would like the OTUs to classified by 'class', and not the OTU name (because it s meaningless).

Expected output:

   class location type  abundance  otu.freq
    ab        a    a      79        2
    av        a    a      21        1
    ab        b    b     100        1
    ab        c    c     100        1
    ad        d    c     100        1
    ab        d    d      24        2         
    ad        d    d      76        2

I have a much larger data frame and have tried the suggestions using dplyr here but I ran out of RAM, so I don't know if it worked or not.

The solution provided by @Akron below does not count the occurrences where the abundance is 0, but it does not get rid of that OTU from the other replicates within that group. If any OTU has an abundance of 0, then it is not shared between that group and I need to completely discount it from the abundance and otu.freq calculation.

library(dplyr)    
so_many_shared3<-datafr1 %>% 
      group_by(class, location, type) %>% 
      summarise(abundance=sum(value)/sum(datafr1[['value']])*100, otu.freq=sum(value !=0))


   class location type  abundance  otu.freq
1    ab        a    a  4.3859649     2
2    ab        b    b 87.7192982     1
3    ab        c    c  0.2923977     1
4    ab        d    d  1.4619883     2
5    ad        b    b  0.0000000     0
6    ad        d    c  0.2923977     1
7    ad        d    d  4.6783626     2
8    av        a    a  1.1695906     1

Solution

  • You could do this in one step using either data.table

    library(data.table)
    val = sum(datafr1$value)
    setDT(datafr1)[order(class,type), list(abundance = 
                   sum(value)/val*100, otu.freq = .N), 
                   by = .(class, location, type)]
    

    Or using dplyr

    library(dplyr)
    datafr1 %>% 
         group_by(class, location, type) %>% 
         summarise(abundance=sum(value)/sum(datafr1[['value']])*100, otu.freq=n())
     #   class location type  abundance otu.freq
     #1    ab        a    a  4.3859649        2
     #2    ab        b    b 87.7192982        2
     #3    ab        c    c  0.2923977        2
     #4    ab        d    d  1.4619883        2
     #5    ad        b    b  0.0000000        1
     #6    ad        d    c  0.2923977        1
     #7    ad        d    d  4.6783626        2
     #8    av        a    a  1.1695906        1
    

    Update

    Based on the new criteria, I am updating the code suggested by the OP (@K.Brannen)

      datafr1 %>%
           group_by(class, location, type) %>% 
           summarise(abundance=sum(value)/sum(datafr1[['value']])*100, 
                 otu.freq=sum(value !=0)) 
    

    Update2

    Based on the updated expected result

      datafr1 %>%
           filter(value!=0) %>% 
           group_by(location, type) %>% 
           mutate(value1=sum(value)) %>% 
           group_by(class, add=TRUE) %>% 
           summarise(abundance=round(100*sum(value)/unique(value1)), 
                             otu.freq=n())
      #    location type class abundance otu.freq
      #1        a    a    ab        79        2
      #2        a    a    av        21        1
      #3        b    b    ab       100        1
      #4        c    c    ab       100        1
      #5        d    c    ad       100        1
      #6        d    d    ab        24        2
      #7        d    d    ad        76        2