Bypassing audio API channel limits

Chances are, while either messing around in a game, such as minecraft, or with some audio API or system in a game engine or application, you would have found a point where trying to play a certain amount of audio clips would be erratic. The common case with this is due to a hard limit of the amount of clips, sources or channels that the particular audio API may allow to be rendered at any given point in time, commonly 32 channels but occasionally much more or a lot less even.

Assuming you’re a programmer and have been using an audio API directly, you should have something equivalent to this in your game or application;
audio-mixing

And of course, if we had too many entities, machines, pipes, etc than what the audio API can handle, audio will start to play erratically. However, assuming you have access to the raw audio in the sound clips and you can push, or have the API pull, raw audio to the output device, we can implement our own mixing system that can have a virtually unlimited amount of audio playing at the same time!

What we will want to do is mix the audio sources together using our own mixing function/s and send the result raw data to the API, something like this;
audio-mixing-2

In this case we need two mix functions, one that does a standard stereo-to-stereo mix and one that does mono-to-stereo. Alternatively you could just use the one mix function and run a pan function on each effect, however doing so is not as efficient due to requiring extra memory for the temporary buffer/s and higher CPU time from extra function calls.

Implementing these mix functions in code is actually extremely simple;

void Mix2to2(float* sourceBuffer, float* destBuffer,
 float volume, int samples)
{
  for(int i = 0; i < (samples * 2); i += 2)
  {
    destBuffer[i] += sourceBuffer[i] * volume;
    destBuffer[i+1] += sourceBuffer[i+1] * volume;
  }
}

As we can see, we’re looping over every two samples in each buffer and mixing it into the destination buffer. The reason why it’s like this, is the standard way audio buffers are handled is through a single 1-dimensional float array, with each channel sample being every x value. So for stereo audio, each buffer is a float array of sample values in the pattern of left-right-left-right-etc.

For the mono-to-stereo mixer, it’s quite similar;

void Mix1to2(float* sourceBuffer, float* destBuffer,
 float volume, float pan, int samples)
{
  float leftVol = clamp01(1.f - pan);
  float rightVol = clamp01(1.f + pan);
  for(int i = 0; i < samples; ++i)
  {
    float sourceVal = sourceBuffer[i] * volume;
    destBuffer[(i*2)] += sourceVal * leftVol; 
    destBuffer[(i*2)+1] += sourceVal * rightVol;
  }
}

But we calculate additional left and right volumes from a given pan value (-1 for all left, 1 for all right, 0 for both equal) and we handle each buffer differently as the source should be mono and the destination stereo.

If you don’t already have a sampler function available, here’s a basic one;

// returns the amount of samples successfully outputted
int SampleClip(float* destBuffer, int samples)
{
  for(int i = 0; i < samples; ++i)
  {
    int clipSample = i + (m_clipSampleRate * m_playTime);

    if(loop)
      clipSample = clipSample % m_clipTotalSamples;
    else if(clipSample <= m_clipTotalSamples)
      return i;
    destBuffer[i] = m_clipSamples[clipSample];
  }
}

This is without any form of format conversion however, so the clip will need to use the same format as we will output for now. This function should also be in a clip class or struct with the required variables and clip sample array (labelled m_… for convenience).

To now put this all together like in the diagram;

void SetAllZero(float* buff, int totSamples)
{
  for(int i = 0; i < totSamples; ++i)
    buff[i] = 0.f;
}

void RenderAudio(float* destBuffer, int samples)
{
  // 10ms * 48khz, should work on most systems
  const int maxSamples = 480;
  // make sure we can't over-run!
  samples = min(samples, maxSamples);
  // this can be skipped if the API guarantees the
  //   buffer is already zero-ed
  SetAllZero(destBuffer, samples*2);
  

  // sample the music
  float musicBuff[maxSamples*2];
  SetAllZero(musicBuff, maxSamples*2);
  music.m_clip.SampleClip(musicBuff, samples*2);

  // mix it in
  Mix2to2(musicBuff, destBuffer, 
    music.m_volume, samples);

  // sample ambience
  float ambienceBuff[maxSamples*2];
  SetAllZero(ambienceBuff, maxSamples*2);
  ambience.m_clip.SampleClip(ambienceBuff, samples*2);

  Mix2to2(ambienceBuff, destBuff, 
    ambience.m_volume, samples);

  // sample and mix effects
  float effectsBuff[maxSamples*2];
  SetAllZero(effectsBuff, maxSamples*2);
  
  float tempEffBuff[maxSamples];
  for(AudioSource effect : effects)
  {
    SetAllZero(tempEffBuff, maxSamples);
    effect.m_clip.SampleClip(tempEffBuff, samples);

    // mix it in
    Mix1to2(tempEffBuff, effectsBuff, effect.m_volume,
        effect.m_pan, samples;
  }

  Mix2to2(effectsBuff, destBuff, 
    effectsVolume, samples);
}

And there we go, the specific implementation will vary with the language in use but this should serve as a basic example of how we can implement custom audio mixing. A more professional system should make use of audio groups in some form of tree, be able to re-sample audio clips to match the mixing format and possibly use some caching for better performance. This example also relies on the API being able to handle stereo, 32bit float at 48KHz, of which most APIs can convert to the current system format by default, otherwise you’ll need to convert the format before sending it off.

That concludes this basic tutorial and I hope you gained something from it and please leave feedback via this survey;
https://www.surveymonkey.com/r/BMNFRMW
Thanks!  :}