I have a corpus V
of TF/IDF vectors, so they are pretty sparse.
It's an array about 2,500 by 150,000.
I want to calculate the cosine similarity between each document in the corpus.
This is almost the most naive way I can think of to do it. I know of three or four optimizations already, but I don't want to assume the answer. I'd like know the most computationally efficient way to use Chapel in this calculation. The goal is to get X
as a symmetric matrix with diag(X) = 0
use Norm,
LinearAlgebra;
var ndocs = 2500,
nftrs = 150000,
docs = 1..ndocs,
ftrs = 1..nftrs,
V: [docs, ftrs] real,
X: [docs, docs] real;
for i in docs {
var n1 = norm(V[i,..]);
for j in (i+1)..ndocs {
var n2 = norm(V[j,..]);
var c = dot(V[i,..], V[j,..]) / (n1*n2);
X[i,j] = c;
X[j,i] = c;
}
}
Compiled using
chpl -I/usr/local/Cellar/openblas/0.2.20/include -L/usr/local/Cellar/openblas/0.2.20/lib -lblas cosim.chpl
== UPDATED ==
This could should actually compile and run. Original code had errors as suggested by @bradcray below
Here are some improvements that can be made to the original implementation:
dot(V[i, ..], V[i, ..])
for all i
into an array to reduce repeated computations.1..V.size
or V.domain
instead of 1..V.shape[1]
V.shape
is computed from the domain sizes, rather than stored as a field.X
in parallelFor more details see the GitHub issue that explores these changes and their impact on the timings.