Making a list of all mutations of a sequence (DNA)

I have a DNA sequence, and I want to find all instances of it, or any of its possible mutations in a list of DNA sequence reads. I am using grepl to do this, since it is faster than matchPattern in the instance I am using it. I use parLapply to feed my vector of mutations to the grepl function. But what I am interested in doing is making an easy way of auto-generating my vector of sequence mutations. Originally I typed each mutation, but that leaves room for human error, and if the sequence is lengthened, more mutations would need to be typed. In addition, my current code only allows 1 mutation, and some sequences should allow for more mutations than others. I am not looking for someone to write a loop for me, but just give me a suggestion for accounting for any string.

Right now, I have a semi-automated way of generating the mutations. It now generates the vector without me typing them all out, but only works for 8 nucleotide long sequences. There has to be a better way to generate the vector for any nucleotide sequence of any length.

This is my code:

#My sequence of interest
seq1 <- "GGCGACTG"
lenseq1 <- nchar(seq1)

#A vector of the length of the sequence I wish to create all mutations of
mutsinseq1 <- rep(seq1, 5*lenseq1+4*(lenseq1-1)+1)

#The possible substitutions, insertions, and deletions to the sequence of interest
possnuc <- c("A","T","C","G","")
lenpossnuc <- length(possnuc)

#changing all elements of the vector except for the first
#the first 8 if statements are nucleotide substitutions or deletions
#the other if statements allow for inserts between nucleotides
for(i in 2:length(mutsinseq1)){
  if(i<7){
    mutsinseq1[i] <- paste(possnuc[i-1],substr(seq1,2,lenseq1),sep = "") 
  } else if(i<12){
    mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-6],substr(seq1,3,lenseq1),sep = "")
  } else if(i<17){
    mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-11],substr(seq1,4,lenseq1),sep = "")
  } else if(i<22){
    mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-16],substr(seq1,5,lenseq1),sep = "")
  } else if(i<27){
    mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-21],substr(seq1,6,lenseq1),sep = "")
  } else if(i<32){
    mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-26],substr(seq1,7,lenseq1),sep = "")
  } else if(i<37){
    mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-31],substr(seq1,8,lenseq1),sep = "")
  } else if(i<42){
    mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-36],sep = "")
  } else if(i<46){
    mutsinseq1[i] <- paste(substr(seq1,1,1),possnuc[i-41],substr(seq1,2,lenseq1),sep = "")
  } else if(i<50){
    mutsinseq1[i] <- paste(substr(seq1,1,2),possnuc[i-45],substr(seq1,3,lenseq1),sep = "")
  } else if(i<54){
    mutsinseq1[i] <- paste(substr(seq1,1,3),possnuc[i-49],substr(seq1,4,lenseq1),sep = "")
  } else if(i<58){
    mutsinseq1[i] <- paste(substr(seq1,1,4),possnuc[i-53],substr(seq1,5,lenseq1),sep = "")
  } else if(i<62){
    mutsinseq1[i] <- paste(substr(seq1,1,5),possnuc[i-57],substr(seq1,6,lenseq1),sep = "")
  } else if(i<66){
    mutsinseq1[i] <- paste(substr(seq1,1,6),possnuc[i-61],substr(seq1,7,lenseq1),sep = "")
  } else{
    mutsinseq1[i] <- paste(substr(seq1,1,7),possnuc[i-65],substr(seq1,8,lenseq1),sep = "")
  }
}

#getting rid of duplicate mutations
mutsinseq1 <- mutsinseq1[-which(duplicated(mutsinseq1))]

The following is what I wish to produce (and is produced by my current code):

mutsinseq1
[1] "GGCGACTG"  "AGCGACTG"  "TGCGACTG"  "CGCGACTG"  "GCGACTG"   "GACGACTG"  "GTCGACTG"  "GCCGACTG"  "GGAGACTG"  "GGTGACTG"  "GGGGACTG"  "GGGACTG"   "GGCAACTG" 
[14] "GGCTACTG"  "GGCCACTG"  "GGCACTG"   "GGCGTCTG"  "GGCGCCTG"  "GGCGGCTG"  "GGCGCTG"   "GGCGAATG"  "GGCGATTG"  "GGCGAGTG"  "GGCGATG"   "GGCGACAG"  "GGCGACCG" 
[27] "GGCGACGG"  "GGCGACG"   "GGCGACTA"  "GGCGACTT"  "GGCGACTC"  "GGCGACT"   "GAGCGACTG" "GTGCGACTG" "GCGCGACTG" "GGGCGACTG" "GGACGACTG" "GGTCGACTG" "GGCCGACTG"
[40] "GGCAGACTG" "GGCTGACTG" "GGCGGACTG" "GGCGAACTG" "GGCGTACTG" "GGCGCACTG" "GGCGATCTG" "GGCGACCTG" "GGCGAGCTG" "GGCGACATG" "GGCGACTTG" "GGCGACGTG" "GGCGACTAG"
[53] "GGCGACTCG" "GGCGACTGG"

How do I solve the problem?

Solution

In other languages, you might do this with a series of nested loops, but in R, there's some nice combinatorics functions. Here's the overall function to do what you want:

library(stringr)
library(purrr)
library(dplyr)

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","_")) {
  l_str <- str_length(string)

  choices <- cross(list(
    cols = combn(seq_len(l_str), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

  choice_matrix <- 
    map_dfr(choices, as_tibble, .id = "rows") %>% 
    mutate(rows = as.numeric(rows))

  seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

  seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
  apply(seq_matrix, 1, paste, collapse = "")
}

I used some packages to make things a little easier on me, but it could all be translated into base R.

Here's example output:

mutate_sequence("ATCG", num = 2)

  [1] "aaCG" "aTaG" "aTCa" "AaaG" "AaCa" "ATaa" "taCG" "tTaG" "tTCa" "AtaG" "AtCa" "ATta" "caCG" "cTaG"
 [15] "cTCa" "AcaG" "AcCa" "ATca" "gaCG" "gTaG" "gTCa" "AgaG" "AgCa" "ATga" "_aCG" "_TaG" "_TCa" "A_aG"
 [29] "A_Ca" "AT_a" "atCG" "aTtG" "aTCt" "AatG" "AaCt" "ATat" "ttCG" "tTtG" "tTCt" "AttG" "AtCt" "ATtt"
 [43] "ctCG" "cTtG" "cTCt" "ActG" "AcCt" "ATct" "gtCG" "gTtG" "gTCt" "AgtG" "AgCt" "ATgt" "_tCG" "_TtG"
 [57] "_TCt" "A_tG" "A_Ct" "AT_t" "acCG" "aTcG" "aTCc" "AacG" "AaCc" "ATac" "tcCG" "tTcG" "tTCc" "AtcG"
 [71] "AtCc" "ATtc" "ccCG" "cTcG" "cTCc" "AccG" "AcCc" "ATcc" "gcCG" "gTcG" "gTCc" "AgcG" "AgCc" "ATgc"
 [85] "_cCG" "_TcG" "_TCc" "A_cG" "A_Cc" "AT_c" "agCG" "aTgG" "aTCg" "AagG" "AaCg" "ATag" "tgCG" "tTgG"
 [99] "tTCg" "AtgG" "AtCg" "ATtg" "cgCG" "cTgG" "cTCg" "AcgG" "AcCg" "ATcg" "ggCG" "gTgG" "gTCg" "AggG"
[113] "AgCg" "ATgg" "_gCG" "_TgG" "_TCg" "A_gG" "A_Cg" "AT_g" "a_CG" "aT_G" "aTC_" "Aa_G" "AaC_" "ATa_"
[127] "t_CG" "tT_G" "tTC_" "At_G" "AtC_" "ATt_" "c_CG" "cT_G" "cTC_" "Ac_G" "AcC_" "ATc_" "g_CG" "gT_G"
[141] "gTC_" "Ag_G" "AgC_" "ATg_" "__CG" "_T_G" "_TC_" "A__G" "A_C_" "AT__"

I made the mutations lowercase or "_" to make it obvious where they are, but you can easily change that to get them back to "normal" sequences.

So each line does some things:

l_str <- str_length(string)

Gets the number of characters in the string.

combn(seq_len(l_str), num, simplify = F)

1) This is all possible combinations of positions along the sequence (indexes), taken num at a time, for the number of mutations.

rerun(num, nucleotides)

2) This repeats your vector of nucleotides num times, and makes it a list. cross(rerun(num, nucleotides)) then gives you every combination from that list as a list, so you're taking every possible combination of nucleotides, with repeats. cross(rerun(num, nucleotides)) %>% map(unlist) collapses the deepest level of the list into vectors.

So those last two chunks give you every possible choice of positions, and then every possible combination of replacements. Then we need every possible combination of those as pairs!

  choices <- cross(list(
    cols = combn(seq_len(l_str), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

For the above output, that means:

[[1]]
[[1]]$`cols`
[1] 1 2

[[1]]$muts
[1] "A" "A"


[[2]]
[[2]]$`cols`
[1] 1 2

[[2]]$muts
[1] "T" "A"
...

So first for positions 1/2, it gives us A/A, T/A, G/A, C/A, _/A, etc. Then each combination again for positions 1/3, then positions 1/4, then 2/3, then 2/4, then 3/4.

So now you have this long list, and let's make it into something nicer. First we make each element into a dataframe with cols and muts, then bind them all into a single one with an identifier for each element called rows:

map_dfr(choices, as_tibble, .id = "rows")

# A tibble: 50 x 3
   rows   cols muts 
   <chr> <int> <chr>
 1 1         1 A    
 2 1         2 A    
 3 2         1 T    
 4 2         2 A    
 5 3         1 C    
 6 3         2 A    
 7 4         1 G    
 8 4         2 A    
 9 5         1 _    
10 5         2 A
# ... with 40 more rows

This gives us a long dataframe. Each of rows is one output string, and the cols tells us which position in the string will be replaces. muts is the characters that will go in those positions. In order to do the subsetting later, we'll then convert rows to numeric, using mutate(...).

seq_matrix <- str_split(rep(string, max(choice_matrix$rows)), "", simplify = T)

Now we take your original string and repeat it as many times as the choice_matrix tells us we'll have mutated sequences. Then we take that vector, and split every one along the character boundaries:

      [,1] [,2] [,3] [,4]
 [1,] "A"  "T"  "C"  "G" 
 [2,] "A"  "T"  "C"  "G" 
 [3,] "A"  "T"  "C"  "G" 
 [4,] "A"  "T"  "C"  "G" 
 [5,] "A"  "T"  "C"  "G" 
 [6,] "A"  "T"  "C"  "G" 
 ...

Now we have a big matrix, and R is fast at operating on these big matrices. We could have done all the other steps with matrix operations, but that seemed like more work than using this list-combination functions.

seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)

This identifies each position based on the rows and cols in the choice_matrix. Then it puts the appropriate value from muts in it. This is also where you can take out str_to_lower to keep them from being lowercase. You'd change the default argument of nucleotides to make the "_" into "".

       [,1] [,2] [,3] [,4]
  [1,] "a"  "a"  "C"  "G" 
  [2,] "a"  "T"  "a"  "G" 
  [3,] "a"  "T"  "C"  "a" 
  [4,] "A"  "a"  "a"  "G" 
  [5,] "A"  "a"  "C"  "a" 
  [6,] "A"  "T"  "a"  "a" 
  ...

So row 1 got "A" and "A" in positions 1 and 2. Then row 2 got "A" and "A" in positions 1 and 3, etc. Now we just have to apply across each row (that's what the 1 in apply(..., 1, ...) does) a function to combine each row into a single string. That would be paste(..., collapse = "").

This will get you huge output quickly. If you do 3 mutations on your original 8 nucleotide sequence, you get an output of 7000. 4 mutations is 43750. And each time gets that much slower, taking about 5s to run the 4 mutations on my desktop. You could precalculate the output length, which is choose(l_str, num) * length(nucleotides)^num.

Updated, again:

To handle insertions as well as deletions, we just need the character matrix to have a slot for every possible insertion. Here's that version:

mutate_sequence <- function(string, num = 1, nucleotides = c("A","T","C","G","")) {
  if (num < 1) {return(string)}

  l_str <- str_length(string)
  l_pos <- (num + 1)*(l_str - 1) + 1

  choices <- cross(list(
    cols = combn(seq_len(l_pos), num, simplify = F),
    muts = cross(rerun(num, nucleotides)) %>% map(unlist)
  ))

  choice_matrix <- 
    map_dfr(choices, as_data_frame, .id = "rows") %>% 
    mutate(rows = as.numeric(rows))

  blanks <- character(l_pos)
  orig_pos <- (seq_len(l_str) - 1) * (num+1) + 1
  blanks[orig_pos] <- str_split(string, "", simplify = T)

  seq_matrix <- matrix(
    rep(blanks, max(choice_matrix$rows)), 
    ncol = l_pos, byrow = T
    )

  seq_matrix[as.matrix(choice_matrix[,1:2])] <- str_to_lower(choice_matrix$muts)
  sequences <- apply(seq_matrix, 1, paste, collapse = "")
  sequences[!duplicated(str_to_upper(sequences))]
}

This does essentially the same as the version of the function above, but first you create a blank vector with enough spots for every insertion. For each original nucleotide, you need an additional spot to insert after it, except the last one. That works out to l_pos <- (num + 1)*(l_str - 1) + 1 positions. character(l_pos) gives you the blanks, and then you fill in the blanks with the original nucleotides at (seq_len(l_str) - 1) * (num+1) + 1.

For example, ATCG with two mutations becomes "A" "" "" "T" "" "" "C" "" "" "G". The rest of the function works the same, just putting every possible nucleotide (or deletion) in every possible spot.

The output before pasteing it all back together then looks like:

     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "a"  "a"  ""   "T"  ""   ""   "C"  ""   ""   "G"  
[2,] "a"  ""   "a"  "T"  ""   ""   "C"  ""   ""   "G"  
[3,] "a"  ""   ""   "a"  ""   ""   "C"  ""   ""   "G"  
[4,] "a"  ""   ""   "T"  "a"  ""   "C"  ""   ""   "G"  
[5,] "a"  ""   ""   "T"  ""   "a"  "C"  ""   ""   "G" 
...

Then after pasteing each row, we can check for repeats with duplicated and exclude those. We could also just get rid of the lowercase mutations and use unique(sequences) instead. Now the output is much shorter than before, the first 55 of 278:

  [1] "aaTCG"  "aaCG"   "aTaCG"  "aTaG"   "aTCaG"  "aTCa"   "AaaTCG" "AaaCG"  "AaTaCG" "AaTaG"  "AaTCaG"
 [12] "AaTCa"  "AaaG"   "AaCaG"  "AaCa"   "ATaaCG" "ATaaG"  "ATaCaG" "ATaCa"  "ATaa"   "ATCaaG" "ATCaa" 
 [23] "taTCG"  "taCG"   "tTaCG"  "tTaG"   "tTCaG"  "tTCa"   "AtaTCG" "AtTaCG" "AtTaG"  "AtTCaG" "AtTCa" 
 [34] "ATta"   "ATCtaG" "ATCta"  "caTCG"  "caCG"   "cTaCG"  "cTaG"   "cTCaG"  "cTCa"   "AcaTCG" "AcaCG" 
 [45] "AcTaCG" "AcTaG"  "AcTCaG" "AcTCa"  "AcaG"   "AcCaG"  "AcCa"   "ATcaCG" "ATcCaG" "ATcCa"  "gaTCG"
...