Speaker identification embeddings audio fragment length

I have a base of audio samples matched with concrete speaker like

nick_sample1.mp3 nick_sample2.mp3 ... nick_sampleN.mp3

john_sample1.mp3 john_sample2.mp3 ... john_sampleK.mp3

The task is to match a given sampleX.mp3 with one of the known speakers (or none of them). SampleX.mp3 is itself a result of diarization process, so it most likely contains 1 speaker in my case. My current idea is to break known samples into fragments of equal length and calculate embeddings (pyannote). Then train classifier for each speaker (not sure which one to use at the moment). The classifier will say likelyhood for a given embedding to belong to say Nick.

So the identification process is the following:

break sampleX.mp3 into fragments
each fragment's embeddings go through each classifier
calculate likelyhood score for each speaker, the largest wins and is considered the speaker in sampleX

Questions:

How to break sampleX.mp3 into fragments, is there a guideline or smth?
What is the best option for classfier?

Solution

Your overall approach is good. However for speaker identification using embeddings, the standard approach would be to use a distance function on the vectors - not a classifier model. A commonly used distance function for embeddings is cosine distance. Procedure: Compute the distance to all known speaker samples. If there are no matches below a certain distance, then return 'unknown'. Otherwise, return the closest match.

For splitting into sections, you cut the audio samples. Usually it is represented as numpy array. You compute the start and end sample indices, by taking a time in seconds and multiplying by the samplerate (and convert to integer). You may want to use overlapping fragments.