Search code examples
rbioinformaticsstring-comparisonedit-distance

Calculate edit distance percentage


I am attempting to get a percentage of an edit distance from a group of sequences. So far this is what I have:

library(stringdist)

sequence <- c("CA--------W----------------------EKDRRTEAF---F------",
   "CA--------W----------------------EKDRRTEAF---F------", 
   "CA--------S-------------------SLVFGQGDNIQY---F------", 
   "RA--------S-------------------SLIYSP----LH---F------")

edit_dist <- stringdistmatrix(sequence)
#0 
#13 13 
#14 14 11

len <- stri_length(gsub('-', '', sequence))
#13    13    16    12

As each line of len is equivalent to each line of sequence, when comparing the two lines I would like to use the largest len to get the percentage. So as when having an edit distance between the second and third sequence it would use the length of 16 rather than 13 to get a percentage.

I know this code is wrong, but it is generally the idea I am going for:

for (i in len) {
  num1 <- len[i]
  for (j in len){
    num2 <- len[j] 
    if (num2 > num1){
        num <- num2
        }else{
          num <- num1
        }
    }
    edit_dist/num
}

The answer should look something similar to that below:

0
.8125  .8125
1.0769  1.0769  .6875

Solution

  • You can construct a suitable matrix of the max length with outer and pmax, which you can then coerce to dist class (like edit_dist) so you can divide:

    edit_dist <- stringdistmatrix(sequence)
    n <- nchar(gsub('-', '', sequence))
    
    edit_dist / as.dist(outer(n, n, pmax))
    ##          1        2        3
    ## 2 0.000000                  
    ## 3 0.812500 0.812500         
    ## 4 1.076923 1.076923 0.687500