I have two character vectors describing the same objects, which have been produced by two different annotation programs. I need to make sure that the annotation is the same, however the description is not necessarily worded in the same way. I believe i need to do most of the work manually but i wonder if there is an R function that can calculate, for example, how many words are equal between each value of the two vectors. Or perhaps generate some sort of identity score. In this way i can at least order by similarity score. Below a small example of the dataset:
Annotation <- data.frame(Annotation.A = c("PREDICTED: similar to endonuclease domain containing 1 Coiled-coil domain-containing protein 58", "G protein pathway suppressor 2", "adducin 3a"), Annotation.B = c("PREDICTED: endonuclease domain-containing 1 protein-like [Xiphophorus maculatus] coiled-coil domain-containing protein 58 [Salmo salar]", "PREDICTED: G protein pathway suppressor 2-like [Takifugu rubripes]", "PREDICTED: gamma-adducin-like isoform X7 [Maylandia zebra]" ))
Any help would be appreciated! Thank you
This defines a mismatch score between each of the two elements of each row of Annotation
and applies it giving a score to each row:
a <- Annotation
ch <- replace(a, TRUE, lapply(a, sub, pat = " *$", replace = "")) # rm trailing spaces
w <- lapply(ch, strsplit, " ") # split into words
mismatch <- function(x, y)
(length(setdiff(x, y)) + length(setdiff(y, x))) / length(intersect(x, y))
# calculate mismatch score for each row of Annotation
mismatches <- sapply(1:nrow(a), function(i) mismatch(w[[1]][[i]], w[[2]][[i]]))
cutoff <- 2 # might need to change this
ok <- mismatches < cutoff
Also try using just the numerator of mismatch()
to see if that is a better measure.