I've been using a pre-trained VITS model (VCTK dataset) for text-to-speech synthesis. I've successfully obtained a list of available speakers using the command:
!tts --model_name tts_models/en/vctk/vits --list_speaker_idxs
Additionally, I've synthesized audio from one of the speakers (p234) using the following code:
!tts --text "Working on a big project. Good wishes!" \
--out_path /content/speech.wav \
--model_name tts_models/en/vctk/vits \
--speaker_idx p234
Now, I'm facing a challenge where I need to synthesize audio from the same pre-trained model but with the voice of a speaker who wasn't present in the dataset during training. I understand that I need to provide a reference audio for this purpose (Zero shot).
Can someone guide me on how to achieve this? Any suggestions or code examples would be highly appreciated. Thank you!
For model: tts_models/en/vctk/vits you cannot specify a reference voice/speaker. This model does not support voice cloning.
You can carryout TTS with a default voice and then carry out voice conversion so that generated voice sounds like your reference voice.
knn-vc is a good voice conversion model.