When segmented speech audio was added to DNN model, I understood that the average value of the features extracted from the last hidden layer is 'd-vector'. In that case, I want to know if the d-vector of the speaker can be extracted even if I put the voice of the speaker without learning. By using this, when a segmented value of a voice file spoken by multiple people (using a mel-filterbank or MFCC) is put in, can we distinguish the speaker by clustering the extracted d-vector value as mentioned before?
To answer your questions:
After you train the model, you can get the d-vector
simply by forward-propagating the input vector through the network. Normally you look at the output (final layer) of the ANN, but you can equally retrieve values from penultimate (the d-vector
) layer.
Yes, you can distinguish speakers with the d-vector
, as it produces in a way a high-level embedding of the audio signal that will have unique features for different people. See e.g. this paper.