Search code examples
rsequencestring-comparison

What is a efficient way to encode a string with numbers based on comparison with another string, using a score table?


Let's say we have a score table -

score_matrix = t(data.frame('A' = c('A' = 1,'T' = 1,'G' = 2,'C' = 2),
                            'T' = c('A' = 1,'T' = 1,'G' = 2,'C' = 2),
                            'G' = c('A' = 2,'T' = 2,'G' = 1,'C' = 1),
                            'C' = c('A' = 2,'T' = 2,'G' = 1,'C' = 1)))

>
  A T G C
A 1 1 2 2
T 1 1 2 2
G 2 2 1 1
C 2 2 1 1

Now I want to compare multiple strings of equal length...

Query = rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T)))
Subject = rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T)))

> Query
[1] "TTATACCAGTGTATGATGAGCCTCG"
> Subject
[1] "GTAGCTCACGAATATATGAACCTCA"

...and match the Subject string to the query and convert it to a sequence of scores based on the matrix above i.e. -

2 1 1 2 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2

Code for above comparison -

for (i in 1:length(unlist(strsplit(Query,"")))) { temp = cat(score_mat[unlist(strsplit(Query,""))[i],unlist(strsplit(Subject,""))[i]],"") }

My actual set would be a much larger set in a matrix format for example -

data_matrix = matrix(unlist(strsplit(Query,"")),nrow = 1)
data_matrix = rbind(data_matrix,matrix(unlist(strsplit(Subject,"")),nrow = 1))
for(i in 1:23) {
   data_matrix = rbind(data_matrix,
         matrix(unlist(strsplit(rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T))),"")),
         nrow = 1)) }

> dim(data_matrix)
[1] 25 25

And I can compare the letters individually in a nested loop but it's very inefficient. I tried something like this -

for (i in 2:nrow(data_matrix)) {
   for (j in 1:ncol(data_matrix)) {
      data_matrix[i,j] = score_matrix[data_matrix[i,j],data_matrix[1,j]] }

But for my real data which is around 5000 X 5000 matrices this is extremely slow. For reference her's a benchmark for this matrix 25 X 25. My dataset would take exponentially longer time.

microbenchmark(for (i in 2:nrow(data_matrix)) {
    for (j in 1:ncol(data_matrix)) {
        temp2[i,j] = score_mat[data_matrix[i,j],data_matrix[1,j]]}})

Unit: milliseconds
                                                                                                                                                                                        expr
 <the command above>
      min       lq     mean   median       uq      max neval
 133.0899 159.8918 189.5858 173.7305 208.5522 348.1761   100

What would be a more efficient method to approach this problem?

> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_UK.utf8  LC_CTYPE=English_UK.utf8    LC_MONETARY=English_UK.utf8 LC_NUMERIC=C                   LC_TIME=English_UK.utf8    

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] microbenchmark_1.4.9    stringi_1.7.8      doParallel_1.0.17    iterators_1.0.14   foreach_1.5.2    fs_1.5.2 
 [7] S4Vectors_0.34.0        data.table_1.14.6  forcats_0.5.2        stringr_1.5.0      dplyr_1.0.10     purrr_0.3.5 
[13] readr_2.1.3             tidyr_1.2.1        tibble_3.1.8         ggplot2_3.4.0      tidyverse_1.3.2     

Solution

  • You don't need explicit loops here. You can use Map instead, where each letter in the query string is used as a row index of the score matrix, and each letter in the target string is used as a column index.

    unname(unlist(Map(function(a, b) score_matrix[a, b], 
        a = strsplit(Query, '')[[1]], 
        b = strsplit(Subject, '')[[1]])))
    
    #> [1] 2 1 1 2 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2