text-to-speech voice-recognition speech-synthesis

Synthesizing Audio with Unseen Speakers Using Pre-trained VITS Model

I've been using a pre-trained VITS model (VCTK dataset) for text-to-speech synthesis. I've successfully obtained a list of available speakers using the command:

!tts --model_name tts_models/en/vctk/vits --list_speaker_idxs

Additionally, I've synthesized audio from one of the speakers (p234) using the following code:

!tts --text "Working on a big project. Good wishes!" \
--out_path /content/speech.wav \
--model_name tts_models/en/vctk/vits \
--speaker_idx p234

Now, I'm facing a challenge where I need to synthesize audio from the same pre-trained model but with the voice of a speaker who wasn't present in the dataset during training. I understand that I need to provide a reference audio for this purpose (Zero shot).

Can someone guide me on how to achieve this? Any suggestions or code examples would be highly appreciated. Thank you!

Solution

For model: tts_models/en/vctk/vits you cannot specify a reference voice/speaker. This model does not support voice cloning.

You can carryout TTS with a default voice and then carry out voice conversion so that generated voice sounds like your reference voice.

knn-vc is a good voice conversion model.