Let's say we have a score table -
score_matrix = t(data.frame('A' = c('A' = 1,'T' = 1,'G' = 2,'C' = 2),
'T' = c('A' = 1,'T' = 1,'G' = 2,'C' = 2),
'G' = c('A' = 2,'T' = 2,'G' = 1,'C' = 1),
'C' = c('A' = 2,'T' = 2,'G' = 1,'C' = 1)))
>
A T G C
A 1 1 2 2
T 1 1 2 2
G 2 2 1 1
C 2 2 1 1
Now I want to compare multiple strings of equal length...
Query = rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T)))
Subject = rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T)))
> Query
[1] "TTATACCAGTGTATGATGAGCCTCG"
> Subject
[1] "GTAGCTCACGAATATATGAACCTCA"
...and match the Subject string to the query and convert it to a sequence of scores based on the matrix above i.e. -
2 1 1 2 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2
Code for above comparison -
for (i in 1:length(unlist(strsplit(Query,"")))) { temp = cat(score_mat[unlist(strsplit(Query,""))[i],unlist(strsplit(Subject,""))[i]],"") }
My actual set would be a much larger set in a matrix format for example -
data_matrix = matrix(unlist(strsplit(Query,"")),nrow = 1)
data_matrix = rbind(data_matrix,matrix(unlist(strsplit(Subject,"")),nrow = 1))
for(i in 1:23) {
data_matrix = rbind(data_matrix,
matrix(unlist(strsplit(rawToChar(as.raw(sample(c(65,67,71,84), 25, replace=T))),"")),
nrow = 1)) }
> dim(data_matrix)
[1] 25 25
And I can compare the letters individually in a nested loop but it's very inefficient. I tried something like this -
for (i in 2:nrow(data_matrix)) {
for (j in 1:ncol(data_matrix)) {
data_matrix[i,j] = score_matrix[data_matrix[i,j],data_matrix[1,j]] }
But for my real data which is around 5000 X 5000 matrices this is extremely slow. For reference her's a benchmark for this matrix 25 X 25. My dataset would take exponentially longer time.
microbenchmark(for (i in 2:nrow(data_matrix)) {
for (j in 1:ncol(data_matrix)) {
temp2[i,j] = score_mat[data_matrix[i,j],data_matrix[1,j]]}})
Unit: milliseconds
expr
<the command above>
min lq mean median uq max neval
133.0899 159.8918 189.5858 173.7305 208.5522 348.1761 100
What would be a more efficient method to approach this problem?
> sessionInfo()
R version 4.2.2 (2022-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)
Matrix products: default
locale:
[1] LC_COLLATE=English_UK.utf8 LC_CTYPE=English_UK.utf8 LC_MONETARY=English_UK.utf8 LC_NUMERIC=C LC_TIME=English_UK.utf8
attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] microbenchmark_1.4.9 stringi_1.7.8 doParallel_1.0.17 iterators_1.0.14 foreach_1.5.2 fs_1.5.2
[7] S4Vectors_0.34.0 data.table_1.14.6 forcats_0.5.2 stringr_1.5.0 dplyr_1.0.10 purrr_0.3.5
[13] readr_2.1.3 tidyr_1.2.1 tibble_3.1.8 ggplot2_3.4.0 tidyverse_1.3.2
You don't need explicit loops here. You can use Map
instead, where each letter in the query string is used as a row index of the score matrix, and each letter in the target string is used as a column index.
unname(unlist(Map(function(a, b) score_matrix[a, b],
a = strsplit(Query, '')[[1]],
b = strsplit(Subject, '')[[1]])))
#> [1] 2 1 1 2 2 2 1 1 1 2 2 1 1 1 2 1 1 1 1 2 1 1 1 1 2