Search code examples
rsimilarity

Function to evaluate identity between characters strings


I have two character vectors describing the same objects, which have been produced by two different annotation programs. I need to make sure that the annotation is the same, however the description is not necessarily worded in the same way. I believe i need to do most of the work manually but i wonder if there is an R function that can calculate, for example, how many words are equal between each value of the two vectors. Or perhaps generate some sort of identity score. In this way i can at least order by similarity score. Below a small example of the dataset:

Annotation <- data.frame(Annotation.A = c("PREDICTED: similar to endonuclease domain containing 1 Coiled-coil domain-containing protein 58", "G protein pathway suppressor 2", "adducin 3a"), Annotation.B = c("PREDICTED: endonuclease domain-containing 1 protein-like [Xiphophorus maculatus] coiled-coil domain-containing protein 58 [Salmo salar]", "PREDICTED: G protein pathway suppressor 2-like [Takifugu rubripes]", "PREDICTED: gamma-adducin-like isoform X7 [Maylandia zebra]" ))

Any help would be appreciated! Thank you


Solution

  • This defines a mismatch score between each of the two elements of each row of Annotation and applies it giving a score to each row:

    a <- Annotation
    ch <- replace(a, TRUE, lapply(a, sub, pat = " *$", replace = "")) # rm trailing spaces
    w <- lapply(ch, strsplit, " ") # split into words
    
    mismatch <- function(x, y)
      (length(setdiff(x, y)) + length(setdiff(y, x))) / length(intersect(x, y))
    
    # calculate mismatch score for each row of Annotation
    mismatches <- sapply(1:nrow(a), function(i) mismatch(w[[1]][[i]], w[[2]][[i]]))
    
    cutoff <- 2 # might need to change this
    ok <- mismatches < cutoff
    

    Also try using just the numerator of mismatch() to see if that is a better measure.