Efficient construction of cosine similarity matrix from corpus in Chapel

I have a corpus V of TF/IDF vectors, so they are pretty sparse.
It's an array about 2,500 by 150,000.
I want to calculate the cosine similarity between each document in the corpus.

This is almost the most naive way I can think of to do it. I know of three or four optimizations already, but I don't want to assume the answer. I'd like know the most computationally efficient way to use Chapel in this calculation. The goal is to get X as a symmetric matrix with diag(X) = 0

use Norm,
    LinearAlgebra;

var ndocs = 2500,
    nftrs = 150000,
    docs = 1..ndocs,
    ftrs = 1..nftrs,
    V: [docs, ftrs] real,
    X: [docs, docs] real;

for i in docs {
  var n1 = norm(V[i,..]);
  for j in (i+1)..ndocs {
    var n2 = norm(V[j,..]);
    var c = dot(V[i,..], V[j,..]) / (n1*n2);
    X[i,j] = c;
    X[j,i] = c;
  }
}

Compiled using

chpl -I/usr/local/Cellar/openblas/0.2.20/include -L/usr/local/Cellar/openblas/0.2.20/lib -lblas cosim.chpl

== UPDATED ==

This could should actually compile and run. Original code had errors as suggested by @bradcray below

Solution

Here are some improvements that can be made to the original implementation:

Pre-compute and cache dot(V[i, ..], V[i, ..]) for all i into an array to reduce repeated computations.
Use 1..V.size or V.domain instead of 1..V.shape[1]
- V.shape is computed from the domain sizes, rather than stored as a field.
Exploit the embarrassingly parallel nature of this program, by computing X in parallel

For more details see the GitHub issue that explores these changes and their impact on the timings.