Search code examples
rdata.tablesapply

How to use conditions on value in all rows of a R datatable


I have been struggling with, I think, a basic task in R but I am still new in that and couldn't achieve that with the forums I found.

Here is my dataset:

       Read SNP.mapped DEL.mapped INS.mapped SNP.true DEL.true INS.true       Method Population
   1:    0          0          0          0        0        0        0 E_B1    E     
   2:    1          0          0          0        0        0        0 E_B1    E     
   3:  100          0          0          0        0        0        0 E_B1    E     
   4: 1000          0          0          0        0        0        0 E_B1    E     
   5: 100B          0          0          0        0        0        0 E_B1    E     
   ...
   30657866:  ZZ2          0          0          0        0        0     0 C_N9    C     
   30657867:  ZZI          0          0          0        0        0     0 C_N9    C     
   30657868:  ZZO          0          0          0        1        0     0 C_N9    C     
   30657869:  ZZV          0          0          0        0        0     0 C_N9    C     
   30657870:  ZZZ          0          0          0        0        0     0 C_N9    C     

Here is the example of what I want to achieve for the first row on my datable called "all.dataSNP0" :

length(unique(all.dataSNP0$Read[which(all.dataSNP0$Population =="C" & all.dataSNP0$Method =="C_B1")])) / length(unique(all.dataSNP0$Read[which(all.dataSNP0$Population=="C")]))

The results is what I expect and works perfectly fine. However now I am trying to apply this line to all column but I don't know how to use the actual value of the column in the conditions when I loop through it. I tried to do so:

all.dataSNP0[, Ratio:=sapply(length(unique(all.dataSNP0$Read[which(Population == .Population & Method == .Method)])) / length(unique(all.dataSNP0$Read[which(Population== .Population)])), "[",1)]

But it doesn't seems to work. I do think I must be not too far but can't find it,

Thanks

Eddie


Solution

  • You can use uniqueN to count number of unique values, we can do it for each unique value in Population and Method and then count the ratio for each value of Population.

    library(data.table)
    all.dataSNP0[, count := uniqueN(Read), .(Population, Method)]
    all.dataSNP0[, count := count/sum(count), Population]
    

    The same using dplyr can be done as :

    library(dplyr)
    
    all.dataSNP0 %>%
      group_by(Population, Method) %>%
      mutate(count = n_distinct(Read)) %>%
      group_by(Population) %>%
      mutate(count = count/sum(count))