I have a text vector with the names of drugs already registered, and another with the names of new drugs. I want to know whether the new drugs look like an already existing drug or not.
For example, if supercure is a drug which can be producted either by firm1 or firm2, and supercure firm1 1000mg
and supercure firm2 500mg
are already registered, then supercure firm1 500 mg
should be associated with both of them.
agrep
allows to do such matching in R, and sapply
allows to do it for every drug in the new list :
new<-c("supercure firm1 500mg","randomcure firm2 1000mg","unknowncure firm2 100mg")
registered<-c("supercure firm1 1000mg","supercure firm2 500mg","randomcure firm1 1000mg")
res<-unlist(sapply(new,agrep,x=registered))
res
As expected, supercure gets two matches, randomcure one match and unknowncure no match (which is what I want). However, sapply
appears to have altered the names so that there is no duplicate : supercure firm1 500mg
became supercure firm1 500mg1
and supercure firm1 500mg2
:
supercure firm1 500mg1 supercure firm1 500mg2 randomcure firm2 1000mg
1 2 3
This is a problem because it prevents me to select matched drugs from the new list :
new[new %in% names(res)]
only catches randomcure (because supercure's name has been altered).
I can think of ways of fixing this by quite graceless text processing, but is there a more clever way of getting the list of new drugs who found a match ?
The ideal output would be :
supercure firm1 500mg supercure firm1 500mg randomcure firm2 1000mg
1 2 3
sapply
didn't alter the name, unlist
did. This gives the desired output:
x <- sapply(new,agrep,x=registered)
setNames(unlist(x),rep(names(x),lengths(x)))
# supercure firm1 500mg supercure firm1 500mg randomcure firm2 1000mg
# 1 2 3