Is it possible to mix two mono audio tensors of different length (number of frames) in torchaudio?

I have two byte arrays - one from mic and one from soundcard of same duration (15 seconds). They have different formats (sample rate of mic = 44100, n_frames = 1363712; sample rate of stereo = 48000, n_frames=1484160). I had assumed resampling would help (16k desired) but they are still of differing lengths and can't simply be combined (added - am assuming adding tensors will result in mixed audio).

I can't see a built in method for mixing audio, but perhaps I'm overlooking something. I see that sox_effects is included, but none of the effects listed seem relevant - although I know sox can mix audio.

Am I barking up the wrong tree with torchaudio?

Solution

Mixing audio is simply taking sum or average of source waveforms, so TorchAudio does not provide a specialized method, but users are expected to do the operation with pure PyTorch Tensor operation.

Now the problem you need to think is how to handle the different lengths, i.e. how to make them the same length.

You can cut the long one to align it to the short one, or zero-pad the short one to align it to the long one.