I am writing some bigger ML script to detect synonyms and abbreviations in database. One of the metrics of similarity of words is how many of first letters in two strings match. So i have 2 vectors:
v1 <- c("rejtan", "reiki","rejon")
v2 <- c("rejtan", "rejtan", "beiki")
and I want to have this result (% of letters from the beggining of a word that match):
rejtan reiki rejon
rejtan 1 0.3333333 0.5
rejtan 1 0.3333333 0.5
beiki 0 0.0000000 0.0
I came up with this function:
count.first.character.matches <- function(vec1,vec2) {
sapply(X = vec1 , FUN= function(x) {
sapply(X = vec2, FUN = function(y) {
ny <- nchar(y)
nx <- nchar(x)
shorter_length <- ifelse(nx > ny, nx, ny)
matches <- sum(sapply( 1:shorter_length, FUN=function(i,x,y) { substr(x,1,i) == substr(y,1,i)}, x,y ))
matches / shorter_length
})
})
My question is: How one can improve performance of this function? I have set of 65K of vector pairs, each 700-1K words, so I end up with calculating this metric a lot, and according to Rprof this takes approx. 25% of time.
Using your approach as is, there are some things you could change to make it more efficient.
1) nchar
is a function that, unlike length
, has to compute the number of characters of its arguments and not get an attribute. You re computing nchar
for "v2" for every "v1" but, also, nchar
for "v1" for every "v2". You could put nchar(x)
outside of the second sapply
or, even better, utilize the vectorised nature of nchar
and compute everything once and before any sapply
. In particular, having
x = replicate(1e3, paste0(sample(letters, sample(4:10, 1), TRUE), collapse = ""))
y = replicate(1e3, paste0(sample(letters, sample(4:10, 1), TRUE), collapse = ""))
instead of
system.time({
nx = nchar(x)
ny = nchar(y)
})
#user system elapsed
# 0 0 0
you use
system.time({
sapply(x, function(X)
sapply(y, function(Y) {
nX = nchar(X)
nY = nchar(Y)
}))
})
#user system elapsed
#8.08 0.00 8.27
2)substring
is vectorised, so the third sapply
can be avoided. (Also, in checking single characters of a string, strsplit
might be faster and, being itself vectorised, can be computed outside of any loop.)
3)A block of if else
is faster than ifelse
when comparing 'length == 1' vectors. That is completely minor of course, but after two nested sapply
s it adds extra computational time without needing to:
microbenchmark::microbenchmark(replicate(1e4, if(2 < 3 && 5 > 3) 1 else 0),
replicate(1e4, ifelse(2 < 3 && 5 > 3, 1, 0)))
#Unit: milliseconds
# expr min lq median uq max neval
# replicate(10000, if (2 < 3 && 5 > 3) 1 else 0) 14.22543 14.85759 15.09545 15.78781 56.84884 100
# replicate(10000, ifelse(2 < 3 && 5 > 3, 1, 0)) 29.77642 31.44824 36.20305 37.85782 65.72473 100
So, having these in mind:
OP2 = function(v1, v2)
{
nc1 = nchar(v1)
nc2 = nchar(v2)
sv2 = seq_along(v2)
sapply(seq_along(v1),
function(i) {
sapply(sv2,
function(j) {
len = if(nc1[[i]] > nc2[[j]]) nc1[[i]] else nc2[[j]]
ind = seq_len(len)
sum(substring(v1[[i]], 1, ind) == substring(v2[[j]], 1, ind)) / len
})
})
}
And compare with yours:
set.seed(007)
v1b = replicate(1e2, paste0(sample(letters[1:5], sample(4:10, 1), TRUE), collapse = ""))
v2b = replicate(1e2, paste0(sample(letters[1:5], sample(4:10, 1), TRUE), collapse = ""))
sum(count.first.character.matches(v1b, v2b) != OP2(v1b, v2b))
#[1] 0
microbenchmark::microbenchmark(count.first.character.matches(v1b, v2b), OP2(v1b, v2b), times = 20)
#Unit: milliseconds
expr min lq median uq max neval
# count.first.character.matches(v1b, v2b) 932.2840 949.3697 969.6321 985.2237 1081.2882 20
# OP2(v1b, v2b) 161.7503 185.1102 192.3019 197.5060 213.6818 20
Another idea, besides your approach, could be (after changing "OP2" to keep the shortest length):
ff = function(x, y)
{
sx = strsplit(x, "", fixed = TRUE)
sy = strsplit(y, "", fixed = TRUE)
array(mapply(function(X, Y) {
slen = seq_len(min(length(X), length(Y)))
wh = X[slen] == Y[slen]
if(all(wh)) return(1) else (which.min(wh) - 1) / length(slen)
},
rep(sx, each = length(sy)), sy),
c(length(x), length(y)), list(y, x))
}
sum(ff(v1b, v2b) != OP2(v1b, v2b))
#[1] 0
microbenchmark::microbenchmark(ff(v1b, v2b), OP2(v1b, v2b), times = 20)
#Unit: milliseconds
# expr min lq median uq max neval
# ff(v1b, v2b) 72.72661 80.43703 85.85113 89.16066 110.5722 20
# OP2(v1b, v2b) 165.13991 168.15051 176.01596 182.11389 213.9557 20