Search code examples
audioaccelerate-frameworkvdsp

Accelerate framework used, no observable speedup


I have the following piece of audio code that I thought would be a good candidate for using vDSP in accelerate framework.

// --- get pointers for buffer lists
float* left = (float*)audio->mBuffers[0].mData;
float* right = numChans == 2 ? (float*)audio->mBuffers[1].mData : NULL;

float dLeftAccum = 0.0;
float dRightAccum = 0.0;

float fMix = 0.25; // -12dB HR per note

// --- the frame processing loop
for(UInt32 frame=0; frame<inNumberFrames; ++frame)
{
    // --- zero out for each trip through loop
    dLeftAccum = 0.0;
    dRightAccum = 0.0;
    float dLeft = 0.0;
    float dRight = 0.0;

    // --- synthesize and accumulate each note's sample
    for(int i=0; i<MAX_VOICES; i++)
    {
        // --- render
        if(m_pVoiceArray[i]) 
            m_pVoiceArray[i]->doVoice(dLeft, dRight);

        // --- accumulate and scale
        dLeftAccum += fMix*(float)dLeft;
        dRightAccum += fMix*(float)dRight;

    }

    // --- accumulate in output buffers
    // --- mono
    left[frame] = (float)dLeftAccum;

    // --- stereo
    if(right) right[frame] = (float)dRightAccum;
}

// needed???
//  mAbsoluteSampleFrame += inNumberFrames;

return noErr;

Thus I modified it to use vDSP, multiplying fMix at the end of the block of frames.

// --- the frame processing loop
for(UInt32 frame=0; frame<inNumberFrames; ++frame)
{
    // --- zero out for each trip through loop
    dLeftAccum = 0.0;
    dRightAccum = 0.0;
    float dLeft = 0.0;
    float dRight = 0.0;

    // --- synthesize and accumulate each note's sample
    for(int i=0; i<MAX_VOICES; i++)
    {
        // --- render
        if(m_pVoiceArray[i]) 
            m_pVoiceArray[i]->doVoice(dLeft, dRight);

        // --- accumulate and scale
        dLeftAccum += (float)dLeft;
        dRightAccum += (float)dRight;

    }

    // --- accumulate in output buffers
    // --- mono
    left[frame] = (float)dLeftAccum;

    // --- stereo
    if(right) right[frame] = (float)dRightAccum;
}
vDSP_vsmul(left, 1, &fMix, left, 1, inNumberFrames);
vDSP_vsmul(right, 1, &fMix, right, 1, inNumberFrames);
// needed???
//  mAbsoluteSampleFrame += inNumberFrames;

return noErr;

However, my CPU usage still remains the same. I see no perceptible benefit of using vDSP here. Am I doing this correctly? Many thanks.

Still new to vector operations, go easy on me :)

If there are some obvious optimizations that I should be doing (outside of accelerate framework), feel free to point it out to me, thanks!


Solution

  • You're vector call is performing 2 multiplies per sample at audio sample rates. If your sample rate was 192kHz then you're only talking about 384000 multiplies per second - not really enough to register on a modern CPU. Moreover, you're moving existing multiplies to another place. If you had a look at the generated assembly I would guess that the compiler optimized your original code pretty decently and any speed up in the vDSP call is going to be offset by the fact that you are requiring a second loop.

    Another important thing to note is that all of the vDSP functions are going to work better on when the vector data is aligned on a 16-byte boundary. If you take a look at the SSE2 instruction set (which I'm sure vDSP uses heavily) you'll see that many instructions have a version for aligned data and another version for unaligned data.

    The way you would align data in gcc is something like this:

    float inVector[8] = {1, 2, 3, 4, 5, 6, 7, 8} __attribute__ ((aligned(16)));
    

    Or if you're allocating on the heap look if aligned_malloc is available.