I have a base of audio samples matched with concrete speaker like
nick_sample1.mp3 nick_sample2.mp3 ... nick_sampleN.mp3
john_sample1.mp3 john_sample2.mp3 ... john_sampleK.mp3
The task is to match a given sampleX.mp3 with one of the known speakers (or none of them). SampleX.mp3 is itself a result of diarization process, so it most likely contains 1 speaker in my case. My current idea is to break known samples into fragments of equal length and calculate embeddings (pyannote). Then train classifier for each speaker (not sure which one to use at the moment). The classifier will say likelyhood for a given embedding to belong to say Nick.
So the identification process is the following:
Questions:
Your overall approach is good. However for speaker identification using embeddings, the standard approach would be to use a distance function on the vectors - not a classifier model. A commonly used distance function for embeddings is cosine distance. Procedure: Compute the distance to all known speaker samples. If there are no matches below a certain distance, then return 'unknown'. Otherwise, return the closest match.
For splitting into sections, you cut the audio samples. Usually it is represented as numpy array. You compute the start and end sample indices, by taking a time in seconds and multiplying by the samplerate (and convert to integer). You may want to use overlapping fragments.