Search code examples
regexrstringdna-sequence

matching and counting strings (k-mer of DNA) in R


I have a list of strings (DNA sequence) including A,T,C,G. I want to find all matches and insert into table whose columns are all possible combination of those DNA alphabet (4^k; "k" is length of each match - K-mer - and must be specified by user) and rows represent number of matches in sequence in a list.

Lets say my list includes 5 members:

DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")

I want set k=2 (2-mer) so 4^2=16 combination are available including AA,AT,AC,AG,TA,TT,...

So my table will have 5 rows and 16 columns. I want to count number of matches between my k-mers and list members.

My desired result: df:

lstMemb AA AT AC AG TA TT TC ...
  1     2  1  1  0  0  3  0
  2       ...
  3
  4
  5

Could you help me implement this in R?


Solution

  • If you are looking for speed the obvious solution is stringi package. There is stri_count_fixed function for counting patterns. And now, check the code and benchmark!

    DNAlst<-list("CAAACTGATTTT","GATGAAAGTAAAATACCG","ATTATGC","TGGA","CGCGCATCAA")
    dna <- stri_paste(rep(c("A","C","G","T"),each=4),c("A","C","G","T"))
    result <- t(sapply(DNAlst, stri_count_fixed,pattern=dna,overlap=TRUE))
    colnames(result) <- dna
    result
         AA AC AG AT CA CC CG CT GA GC GG GT TA TC TG TT
    [1,]  2  1  0  1  1  0  0  1  1  0  0  0  0  0  1  3
    [2,]  5  1  1  2  0  1  1  0  2  0  0  1  2  0  1  0
    [3,]  0  0  0  2  0  0  0  0  0  1  0  0  1  0  1  1
    [4,]  0  0  0  0  0  0  0  0  1  0  1  0  0  0  1  0
    [5,]  1  0  0  1  2  0  2  0  0  2  0  0  0  1  0  0
    
    
    
    fstri <- function(x){
        t(sapply(x, stri_count_fixed,dna,T))
    }
    fbio <- function(x){
        t(sapply(x, function(x){x1 <-  DNAString(x); oligonucleotideFrequency(x1,2)}))
    }
    
    all(fstri(DNAlst)==fbio(DNAlst)) #results are the same
    [1] TRUE
    
    longDNA <- sample(DNAlst,100,T)
    microbenchmark(fstri(longDNA),fbio(longDNA))
    Unit: microseconds
               expr        min         lq        mean     median         uq        max neval
     fstri(longDNA)    689.378    738.184    825.3014    766.862    793.134   6027.039   100
      fbio(longDNA) 118371.825 125552.401 129543.6585 127245.489 129165.711 359335.294   100
    127245.489/766.862
    ## [1] 165.9301
    

    Ca 165x times faster :)