Search code examples
phpregextext-to-speechfestivalspeech-synthesis

Exploding acronyms to ensure a synthesizer reads them properly?


If I feed a speech synthesizer (festival, in this case, but it applies to all) the following bit of text:

"At the USPGA championship in the US, the BBC reporter went MIA". it reads "At the uspga championship in the us, the BBC reporter went mia".

In other words, I guess that because it's a cluster of consonants, it reads "BBC" properly but makes "words" out of the others.

The simplest thing to do, I suppose, would be to run it through a php script which looked for 2 or more capital letters, and simply "explodes" the word into spaces, like U S P G A.

I realise it would would cause weirdness with things like "I told him N O T to do that", but in news reports that tends to happen less.

Here's the thing; I can "explode" a word OK, the problem is, I'm one of those people who, despite months of trying, just can't get their head round certain aspects of REGEX. In this case, it's looking for: two or more letters next to each other in capitals.

The reason I gave all the pre-amble above is in case there's a better way of doing this I hadn't found or through of - perhaps a db of acronyms to words or something.


Solution

  • Using Delan's regular expression with preg_replace_callback() makes it very easy to put a single space between all the letters of the identified acronyms

    $input = "At the USPGA championship in the US, the BBC reporter went MIA";
    
    function cb_separateCapitals($matches) {
        return implode(' ',str_split($matches[0]));
    }
    
    
    echo $input,'<br />';
    
    $output = preg_replace_callback('/\b([A-Z]{2,})\b/','cb_separateCapitals',$input);
    
    echo $output;
    

    giving

    At the USPGA championship in the US, the BBC reporter went MIA

    At the U S P G A championship in the U S, the B B C reporter went M I A