Search code examples
c++armsignal-processingcortex-m

Fastest way of resizing (rescaling) a 1D vector by an arbitrary factor


I have the following code that does the resizing of a 1D vector with nearest neighbor interpolation in a similar fashion you'd also resize an image. Another term would be resampling, but there seems to be a lot of confusion around these terms (resampling is also a technique in statistics), so I prefer to be more descriptive.

Currently the code looks like this and I need to optimize it:

inline void resizeNearestNeighbor(const int16_t* current, uint32_t currentSize, int16_t* out, uint32_t newSize, uint32_t offset = 0u)
{
    if(currentSize == newSize)
    {
        return;
    }

    const float scaleFactor = static_cast<float>(currentSize) / static_cast<float>(newSize);
    for(uint32_t outIdx = 0; outIdx<newSize; ++outIdx)
    {
        const int currentIdx = static_cast<uint32_t>(outIdx * scaleFactor);
        out[outIdx] = current[(currentIdx + offset)%currentSize];
    }
}

This of course is not hugely efficient because the operation to take the integer part of a float by downcasting is expensive and I don't think it can take any benefit of vectorization in this case. The platform is Cortex M7, so if you're familiar with any vectorization techniques on this platform, it would be also very helpful.

The use case of this code is a sound effect that allows for smoothly changing the length of a delay line (hence the additional offset parameter, since it's a ring buffer). Being able to smoothly change the length of a delay line sounds like slowing down or speeding up playback in a tape recorder, only it's in a loop. Without this scaling, there are lots of clicking noises and artifacts. Currently the hardware struggles with all the DSP and this code on top of that and it can't rescale long delay lines in real time.


Solution

  • Since the Cortex-M series is quite limited (even floating point in M7 is optional), I would estimate a reasonable speed-up coming from using Bresenham's mid point line drawing algorithm.

    This algorithm always advances either N or N+1 elements based on the sign of the error term. The modulus does not need full length division: it suffices to compute currentIdx += N + (delta < 0); if (currentIdx >= currentSize) currentIdx -= currentSize;

    One can also make a "trial divisions" in form of if (currentIdx + 64 * (N+1) < currentSize) to ensure that the next 64 elements do not need modular reduction. M7 has a multiplication unit, but multiplying by shifting is still likely a faster micro-optimisation.

    The Bresenham's algorithm for line drawing is of form

    plotLine(x0, y0, x1, y1)
       dx = x1 - x0
       dy = y1 - y0
       D = 2*dy - dx
       y = y0
    
       for x from x0 to x1
           plot(x,y)
           if D > 0
               y = y + 1
               D = D - 2*dx
           end if
           D = D + 2*dy
    

    Your application does not have x0,x1,y0,y1, but instead it has directly dy = input_size, dx = output_size.

    resample(dx, dy)
       N = dy/dx
       dy = dy % dx
       D = 2*dy - dx;
       y = offset;
       for x from 0 to dx-1
           out[x] = in[y]
           y += N
           if D > 0
              y = y + 1
              D = D - 2*dx
           end if
           D = D + 2*dy
           if (y >= currentSize)
               y -= currentSize
    

    The crucial modification to advance y by N>0 steps is to dy = dy % dx to get the error computation correct.

    One can also use slightly less accurate fixed point DDA algorithm with

    int scale = 65536 * newSize / currentSize;
    int y = offset << 16;
    for (int x = 0; x < newSize; x++) {
        out[x] = in[y >> 16];
        y += scale;
        if (y >= (currentSize << 16))
            y -= (currentSize << 16);
    }