Search code examples
cmatrix3deuclidean-distancevdsp

Vector Accelerated Euclidean Distance in 3D


I am in need of performing a very common and simple matrix operation.
However I need it fast, really fast...

I am already considering a multi-threaded implementation, however for now I just want to see how fast I can get it on a single processor.

The matrix operation is as follows:

I am calculating the euclidean distance between a vector of points (A) and a reference point (B).
The points are in 3D space and each point has a set of X, Y and Z coordinates.
Consequently the vector of points is described by three floating-point arrays holding the X, Y, Z coordinates for each point.
The output is another vector of length N holding the distances between each point in the array and the reference point.

The three XYZ arrays are arranged as columns of a Nx3 matrix.

x[0]      y[0]      z[0]
x[1]      y[1]      z[1]
x[2]      y[2]      z[2]
x[3]      y[3]      z[3]
.         .         .
.         .         .
.         .         .
x[N-1]    y[N-1]    z[N-1]

In memory the matrix is arranged in row-major order as a 1D array containing the values of the X, Y and Z columns sequentially.

x[0], x[1], x[2], x[3] . . . x[N-1], y[0], y[1], y[2], y[3] . . . y[N-1], z[0], z[1], z[2], z[3] . . . z[N-1]

The whole thing is slightly complicated by the fact that we need to add a scalar to each member of the matrix before we take the square root.

The following is the routine in naive C code:

void calculateDistances3D(float *matrix, float Bx, float By, float Bz, float scalar, float *distances, int N)
{
    float *Ax = matrix;
    float *Ay = Ax + N;
    float *Az = Ay + N;
    int i;

    for (i = 0; i < N; i++) {

        float dx = Ax[i] - Bx;
        float dy = Ay[i] - By;
        float dz = Az[i] - Bz;

        float dx2 = dx * dx;
        float dy2 = dy * dy;
        float dz2 = dz * dz;

        float squaredDistance = dx2 + dy2 + dz2;
        float squaredDistancePlusScalar = squaredDistance + scalar;

        distances[i] = sqrt(squaredDistancePlusScalar);
    }
}

…and here is the naive Accelerate implementation (using vDSP and VecLib):
(notice that all processing is performed in-place)

void calculateDistances3D_vDSP(float *matrix, float Bx, float By, float Bz, float scalar, float *distances, int N)
{
    float *Ax = matrix;
    float *Ay = Ax + N;
    float *Az = Ay + N;

    // for each point in the array take the difference with the reference point
    Bx = -Bx;
    By = -By;
    Bz = -Bz;
    vDSP_vsadd(Ax, 1, &Bx, Ax, 1, N);
    vDSP_vsadd(Ay, 1, &By, Ay, 1, N);
    vDSP_vsadd(Az, 1, &Bz, Az, 1, N);

    // square each coordinate
    vDSP_vsq(Ax, 1, Ax, 1, N);
    vDSP_vsq(Ay, 1, Ay, 1, N);
    vDSP_vsq(Az, 1, Az, 1, N);

    // reduce XYZ columns to a single column in Ax (reduction by summation)
    vDSP_vadd(Ax, 1, Ay, 1, Ax, 1, N);
    vDSP_vadd(Ax, 1, Az, 1, Ax, 1, N);

    // add scalar
    vDSP_vsadd(Ax, 1, &scalar, Ax, 1, N);

    // take sqrt
    vvsqrtf(distances, Ax, &N);
}

In the vDSP library the only functions that could be used to calculate distances between vectors are:

vDSP_vdist()
vDSP_distancesq()
vDSP_vpythg()

Maybe I am missing something but as far as I can tell none of them support three input vectors as needed to calculate a distance in 3D.

A couple of things to notice:
(1) I am not comparing distances so I cannot live with the square distance. I need the real distance hence calculating the square root is absolutely necessary.
(2) Taking the reciprocal square root could be a possibility if you really think that doing so will make the code significantly faster.

I have the impression that I am not using the Accelerate framework to its full potential.
I am looking for something a little smarter and maybe concise, doing more work in less function calls. Rearranging the memory in other ways could also be ok, however I think that the memory layout is pretty good as it is.

I am also open to suggestions about other highly optimized/vectorized linear algebra libraries that work on Intel processors. I don’t care if they are commercial or open-source solutions as long as their performance is fast and robust.

The questions is: what is the best function or combination of functions in the Accelerate framework to achieve faster code than the above?

I am developing in Xcode 7 on a MacBook Pro (Retina, 15-inch, Mid 2014) running Mac OS X El Capitan.

Thanks.


Solution

  • Try this one.

    • it performs around 20% better for large N = 2^20 @ 1000 repeats on both my iMac and iPhone
    • in addition, matrix can be treated as strictly read-only, because only distances is manipulated
    • it’s algebraically equivalent but not numerically equivalent to your implementation; expect the output difference to be around 10^-6

    In my opinion, vDSP is too high level for further "targeted" optimization. Instead, you may look into Ray Wenderlich’s iOS Assembly Tutorial as a starting point for using NEON and write your own SIMD instructions for this specific problem.

    Depending on the size N of your problem, you could also gain a further speed-up by using the GPU, for example with Metal.

    void calculateDistances3D_vDSP(float *matrix, float Bx, float By, float Bz, float scalar, float *distances, int N)
    {
      float *Ax = matrix;
      float *Ay = Ax + N;
      float *Az = Ay + N;
    
      float constants = Bx*Bx + By*By + Bz*Bz + scalar;
    
      Bx = -2.0f*Bx;
      By = -2.0f*By;
      Bz = -2.0f*Bz;
    
      vDSP_vsq(Ax, 1, distances, 1, N);                      // Ax^2
      vDSP_vma(Ay, 1, Ay, 1, distances, 1, distances, 1, N); // Ax^2 + Ay^2
      vDSP_vma(Az, 1, Az, 1, distances, 1, distances, 1, N); // Ax^2 + Ay^2 + Az^2
      vDSP_vsma(Ax, 1, &Bx, distances, 1, distances, 1, N);  // Ax^2 + Ay^2 + Az^2 - 2*Bx
      vDSP_vsma(Ay, 1, &By, distances, 1, distances, 1, N);  // Ax^2 + Ay^2 + Az^2 - 2*Bx - 2*By
      vDSP_vsma(Az, 1, &Bz, distances, 1, distances, 1, N);  // Ax^2 + Ay^2 + Az^2 - 2*Bx - 2*By - 2*Bz
      vDSP_vsadd(distances, 1, &constants, distances, 1, N); // ... + constants = (Ax-Bx)^2 + (Ay-By)^2 + (Az-Bz)^2 + scalar
    
      /*
      vDSP_vsq(Ax, 1, distances, 1, N);                      // Ax^2
      vDSP_vsma(Ax, 1, &Bx, distances, 1, distances, 1, N);  // Ax^2 - 2*Ax*Bx
      vDSP_vma(Ay, 1, Ay, 1, distances, 1, distances, 1, N); // Ax^2 - 2*Ax*Bx + Ay^2
      vDSP_vsma(Ay, 1, &By, distances, 1, distances, 1, N);  // Ax^2 - 2*Ax*Bx + Ay^2 - 2*Ay*By
      vDSP_vma(Az, 1, Az, 1, distances, 1, distances, 1, N); // Ax^2 - 2*Ax*Bx + Ay^2 - 2*Ay*By + Az^2
      vDSP_vsma(Az, 1, &Bz, distances, 1, distances, 1, N);  // Ax^2 - 2*Ax*Bx + Ay^2 - 2*Ay*By + Az^2 - 2*Az*Bz
      vDSP_vsadd(distances, 1, &constants, distances, 1, N); // ... + constants = (Ax-Bx)^2 + (Ay-By)^2 + (Az-Bz)^2 + scalar
       */
    
      // take sqrt
      vvsqrtf(distances, distances, &N);
    }