Search code examples
audioreal-timepcm

What is the least amount of (managable) samples I can give to a PCM buffer?


Some APIs, like this one, can create a PCM buffer from an array of samples (represented by a number).

Say I want to generate and play some audio in (near) real time. I could generate a PCM buffer with 100 samples and send them off the sound card, using my magic API functions. As those 100 samples are playing, 100 more samples are generated and then switch the buffers are switched. Finally, I can repeat the writing / playing / switching process to create a constant stream of audio.

Now, for my question. What is the smallest sample-size I can use with the write / play / switch approach, without a perceivable pause in the audio stream occurring? I understand the answer here will depend on sample rate, processor speed, and transfer time to the sound card - so please provide a "rule of thumb" like answer if it's more appropriate!

(I'm a bit new to audio stuff, so please feel free to point out any misconceptions I might have!)


Solution

  • TL;DR: 1ms buffers are easily achievable on desktop operating systems if care is taken; it might not be desirable from a performance and energy usage perspective.

    The lower limit to buffer-size (and this output latency) is limited by the worst-case scheduling latency of your operating system.

    The sequence of events is:

    1. The audio hardware progressively outputs samples from its buffer
    2. At some point, it reaches a low-water-mark and generates an interrupt, signalling that the buffer needs replenishing with more samples
    3. The operating system service the interrupt, and marks the thread as being ready to run
    4. The operating system schedules the thread to run on a CPU
    5. The thread computes, or otherwise obtains samples, and writes them into the output buffer.

    The scheduling latency is the time between step 2 and 4 above, and are dictated largely by the design of the host operating. If using a hard RTOS such as VxWorks or eCos with pre-emptive priority scheduling, the worst case can be in the order of fractions of a uS.

    General purpose desktop operating systems are generally less slick. MacOSX supports real-time user-space scheduling, and is easily capable of servicing 1ms buffers. The Linux kernel can be configured for pre-emptive real-time threads and bottom-half interrupt handlers handled by kernel threads. You ought to also be able to get achieve 1ms buffers sizes there too. I can't comment on the capabilities of recent versions of the NT kernel.

    It's also possible to take a (usually bad) latency hit in step 5 - when your process fills the buffer, if it takes page-fault. Usual practice is to obtain all of the heap and stack memory you require and mlock() it and program code and data into physical memory.

    Absolutely forget about achieving low latency in an interpreted or JITed language run-time. You have far too little control of what the language run-time is doing, and have no realistic prospect preventing page-faults (e.g. for memory allocation). I suspect 10ms is pushing your luck in these cases.

    It's worth noting that rendering short buffers has a significant impact on system performance (and energy consumption) due to the high rate of interrupts and context switches. These destroy L1 cache locality in a way that's disproportionate with the work they actually do.