Search code examples
dictionaryphoneme

Rhyme Dictionary from CMU pronunciation database


I'm looking for a free or open source rhyming database.

I've found the CMU pronunciation "database" and its series of apps but I can't make sense of them or figure out where the data's coming from.

A simple text file with the word and its phonemes is all I need.

Does anybody here know where I'd find one or where I would begin to derive such a list from the CMU files?


Solution

  • cmudict

    The cmudict is a text file and it's format is really simple. First, the word is listed. Then, there are two spaces. Everything following the two spaces is the pronunciation. Where a word may have two different ways of being spoken you will see two entries for the word like

    word
    word(1)
    

    At the beginning of the file they've listed symbols and punctuation. The symbol is followed by the english spelling of said symbols name with no space between them. This is then followed by the two space divider and the arpabet code. Since you're only looking for rhymes you don't have to do anything special with the symbols section since you're never going to be looking for a rhyme to ...ELLIPSIS

    ARPAbet

    The information about how ARPAbet codes map to IPA is listed in wikipedia http://en.wikipedia.org/wiki/Arpabet and each mapping shows example words. It's pretty easy to see how the two relate to one another and that may help you to understand how to read the ARPAbet codes if you are familiar with IPA.

    Summary

    Basically, if you've already found the cmudict then you've already got what you asked for: a database of words and their pronunciations. To find words that rhyme you'll have to parse the flat file into a table and run a query to find words that end with the same ARPAbet code.

    General Theory of Doing Stuff to Things

    Part: Stuff

    1. create a new database
    2. create a table in the database with three fields: index, word, arpabet
    3. read the cmudict file line by line
    4. for each line split it into two parts where two consecutive spaces are found AND
    5. increment the index count, then insert the index number, word, and arpabet code

    Then Umm...

    Once you've got the data into whatever kind of database you choose, you can then use that database to find correlations between the arpabet codes. You could find rhymes, consonance, assonance, and other mnemonic devices. It would go something like

    Part: Thing

    1. get a word you want to find a rhyme for
    2. query the database for the arpabet equivalent of the word
    3. split the arpabet code into pieces by breaking it up everywhere there is a space
    4. take the last piece of the code and, query the database for words whose arpabet codes end matches said piece
    5. Do fancy things with the rhymes

    Shortcuts and Spoilers

    I got bored and wrote a Node.js module that covers "Part: Stuff" listed above. If you've got Node.js installed on your machine you can get the module by running npm install cmudict-to-sqlite See https://npmjs.org/package/cmudict-to-sqlite for the README or just look in the module for docs.