Search code examples
macosspeech-synthesisphoneme

Synthesize phoneme pairs on OSX


I need to create wave-files of 144 phoneme-pairs, such as "Da Di Du, Beh Bi Burr, ..."

Specifically I need each one to maintain a constant pitch, so that I can pitch-shift them to make musical notes (If I could input pitch values that would be even better!).

I don't really want to record 144 .WAV files of me trying to sing them.

Can I do this using OSX's inbuilt speech synthesis API?

If not, is there any other way I can do it?

EDIT: I don't require any particular quality grade. The important thing is that each utterance is distinguishable and at the correct pitch.

EDIT: I will put my attempts at solving this below, if I reach something I'm happy with I will break it into an answer.

Speech Synthesis Programming Guide seems to have everything, it talks about controlling the pitch using contours here, and typing phonetic input here.

However, it would be a lot of work to figure out the whole API and write an OS X project to do it. So I'm interested in commandline options or using existing synthesisers.

CRGreen's answer users parameters to 'say' that I can't find documented in the manpage:

Just found an example here: http://hints.macworld.com/article.php?story=20120204172337402

EDIT: Phonemes https://apple.stackexchange.com/questions/53858/in-terminal-how-to-get-say-to-say-things-right-ie-using-custom-phonetics


Solution

  • In AppleScript script editor:

    set diphones to {"Dah", "Di", "Du", "Beh", "Bi", "Burr"} --etc.
    
    set targetFolder to ((choose folder) as text)
    
    repeat with p in diphones
        say p using "Vicki" pitch 55 modulation 0 saving to (targetFolder & p & ".aif")
    end repeat
    

    Then convert the files to WAV.

    There are a few other options available in the "say" command dictionary.

    I don't think it is as simple as that, however. How the speech synth treats these diphones can be weird, and even different according to which voice you use. You may have to manipulate quite a few to sounds to be the way you want. For example, Vicki says "Di" like "DEE" and "Bi" like "BYE". It is really hard to get those voices to intone a short "i" (as in "big") as just the diphone. It may even be necessary to have it say "big" (for example), then edit the sound in Audacity, cutting off the end and putting a fade out at the end of the edited version, then exporting that. I just did this and it works, but yeah, you'll need to do some special case adjustments. If you have the Developer tools, there is also an app called "Repeat After Me" which allows you to "tune" spoken text, but (surprisingly) for the situation I just described, it doesn't help. (It is pretty powerful for larger chunks, though).

    [edit] so, yes, the phonetic input version of the above could be like this:

    set diphones to {"dAO", "dIH", "dAX", "bEH", "bIH", "brr"} --etc., changed to be phonetic based on Apple's system
    
    set targetFolder to ((choose folder) as text)
    
    repeat with p in diphones
        say ("[[inpt PHON]]" & p & "[[inpt TEXT]]") using "Vicki" pitch 52 modulation 0 saving to (targetFolder & p & ".aif")
    end repeat
    

    [ADDENDUM]

    Years ago Apple's voices would all act the same, and you could tune any voice to perfectly sing a song (I did the "Star Spangled Banner" one night). Then, for some reason, the developers not only changed the voices, but took away the consistency so that some voices behave completely differently compared to others. I wasn't happy about this. Consider the following:

    Using the default voice ("Alex"), the following utterance is (you'll be encouraged to find) even as can be:

    say "[[inpt TUNE]] d {D 114; P 95.0:100} UW {D 227; P 95.0:100} 1IY {D 382; P 95.0:100} . {D 30} [[inpt TEXT]]" using "Alex"
    

    But if you use "Cellos" or "Pipe Organ", you get that bizarre lift at the end, even if you use this TUNE mode. Don't ask me why. So how did I get this to work, at least for "Alex"? I used the aforementioned "Repeat After Me" app and simplified the "tuned" output. I think you can probably get what you want using some variation of TUNE and PHON. But you'll probably have to stay away from "Cellos" and "Pipe Organ" because they are problematic for making monotonous intonations (although they may be fine for certain diphones/triphones). And maybe you'll have to use both, which is, I know, annoying. I feel your pain.

    One more variation. Notice the way the following "rate" tag forces a longer utterance:

    say "[[rate - 66]] [[inpt TUNE]] d {D 114; P 95.0:100} UW {D 227; P 95.0:100} 1IY {D 382; P 95.0:100} . {D 30} [[inpt TEXT]]" using "Alex"
    

    [ADDENDUM II]

    Ah, but check this out. This evens out the "Pipe Organ"; gets rid of the end lift by forcing a pitch modulation ("pbas") before the last phoneme:

    say "[[rate - 66]] [[inpt TUNE]] d {D 114; P 95.0:100} UW {D 227; P 95.0:100} [[pbas - 5]] 1IY {D 382; P 95.0:100} . {D 30} [[inpt TEXT]]" using "Pipe Organ"
    

    They're making us work way too hard here :-)

    Here's a simplified version, going back to your original but sticking that pbas in there:

    say "[[inpt TUNE]] d UW [[pbas - 5]] 1IY [[inpt TEXT]]" using "Pipe Organ"