I've built a U-Net model to perform audio mixing of multitrack audio, for which I've used 20s clips of the audio tracks (converted into spectrograms) as input in training the model. However the training process is incredibly long, so I think it would be better to take 2s clips from each track to train the model.
The data is organised as 8 stems (individual instrument tracks) as the inputs and a single mixture of the stems as the target (all have sr=44100
). I want to find the most energetic 2s section of the mixture track and crop all tracks (input and mixture) this specific 2s part. I'm mainly using librosa in my data preparation but I'm unsure what functions to use to find the start point of the loudest (I understand this is ambiguous) 88200 sample segment (2s).
If I am following the question well enough, the below code might be useful as a starting point. It takes in one sound file and locates where it is "loudest" (as you allude to in the question, defining what bit is loudest is not entirely straight-forward) using librosa.feature.rms
and then cuts a two second slice out of the original sound file centered on that point:
import librosa
FILENAME = 'soundfile.wav' # change to path of your sound file
FRAME_LENGTH = 2048
HOP_LENGTH = 512
NUM_SECONDS_OF_SLICE = 2
sound, sr = librosa.load(FILENAME, sr=None)
clip_rms = librosa.feature.rms(y=sound,
frame_length=FRAME_LENGTH,
hop_length=HOP_LENGTH)
clip_rms = clip_rms.squeeze()
peak_rms_index = clip_rms.argmax()
peak_index = peak_rms_index * HOP_LENGTH + int(FRAME_LENGTH/2)
half_slice_width = int(NUM_SECONDS_OF_SLICE * sr / 2)
left_index = max(0, peak_index - half_slice_width)
right_index = peak_index + half_slice_width
sound_slice = sound[left_index:right_index]