Search code examples
text-to-speechfestival

Festival unit selection voice Missing diphone: # hash


Some background: In trying to build a unit selection voice I followed the steps here: https://github.com/CSTR-Edinburgh/CSTR-Edinburgh.github.io/blob/master/_posts/2016-8-21-Multisyn_unit_selection.md and used a voice definition from here: https://raw.githubusercontent.com/CSTR-Edinburgh/merlin/master/egs/hybrid_synthesis/s1/voice_definition_files/unit_selection/cstr_us_awb_arctic_multisyn.scm. Unfortunately, the wavs were too noisy so I ended up hand-labelling them and skipping the automatic labelling process.

The voice is ok now but still needs some work. One error that occurs constantly is that festival reports "Missing diphone" for any pause to phone transition, e.g.:

festival> (utt.relation.print (SayText "I can say anything I want.") 'Unit)
Missing diphone: #_ay
 diphone still missing, backing off: #_ay
 backed off: #_ay -> #_ax
 diphone still missing, backing off: #_ax
 backed off: #_ay -> #_#
 diphone still missing, backing off: #_#
 backed off: #_ay ->
Missing diphone: ey_eh
 Interword so inserting silence.
 diphone still missing, backing off: ey_#
 backed off: ey_eh -> ax_#
 diphone still missing, backing off: ax_#
 backed off: ey_eh -> #_#
 diphone still missing, backing off: #_#
 backed off: ey_eh ->
Missing diphone: #_eh
 diphone still missing, backing off: #_eh
 backed off: #_eh -> #_ax
 diphone still missing, backing off: #_ax
 backed off: #_eh -> #_#
 diphone still missing, backing off: #_#
 backed off: #_eh ->
Missing diphone: t_#
 diphone still missing, backing off: t_#
 backed off: t_# -> #_#
 diphone still missing, backing off: #_#
 backed off: t_# ->

I tried replacing sil and sp (from the automatic process) in the labels with pau and h# (in order to correspond with the silences used in festival/lib/radio_phones.scm), and I also tried replacing them with just # but this didn't change anything. The source wav/labs definitely contain the transitions above (e.g. several start with "I can") but festival never seems to use these.

How can I get festival to use the pause to phone transitions in the source data?

Thanks!


Solution

  • What was happening was when I was running a script based on the Multisyn unit selection the build_utts part was failing and skipping because the hand-labelled labels didn't match exactly what Festival would have predicted. For example, if the speaker had said "extreme" as eh k s ... but Festival would calculate ih k s ... the build_utts script would fail with an error like:

    align missmatch at ih (0.000000) eh (2.810566)
    

    I manually ran the build_utts script for each utterance and adjusted the label accordingly. If, like me, you are foolish enough to try hand-labelling yourself a couple of tips that helped me:

    • Consider removing any phone closures such as t_cl or d_cl as these can really mess it up when it's trying to match
    • Make sure there is a pause (i.e. #) at the start and end of each utterance as the build_utts script won't complain about it but when running the voice in Festival you will get an error like:

              -=-=-=-=-=- EST Error -=-=-=-=-=-
              {FND} Feature end not defined
      
              -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
      

    Thanks to @NikolayShmyrev for pointing me in the right direction. He also recommended using Ossian instead of Festival which uses python rather than Festival's fairly difficult code.