I'm going to be storing pronunciations of words in a MySQL table and I'd like to have a way to quickly search for pronunciations that have certain properties.
Here's an example of what some data might look like:
The simplest approach to storing the list data that I could think of was to just dump the list as a json-encoded string. The problem is that that approach doesn't seem to allow for a useful index on that column.
Here are some example queries I'd like to be able to run very quickly (with the relevant data indexed in my database):
One idea I've had is to separate out the lexical stress information (the numbers to the right of each vowel phoneme in the example above) and store that stress information separately (e.g. [2, 0, 1, 0, 0]), but even then I'm not sure how I could index it.
Another idea that would seem to at least solve the lexical stress queries might be to store every possible lexical stress pattern in a separate table, and then have each pronunciation simply refer (via a foreign key) to one of those lexical stress patterns. I could then run my (Python) server code against that smaller set of possible lexical stress patterns to determine which patterns were a match, and then query the pronunciation table for those pronunciations which had one of those stress patterns. But I'm curious to know if there's some way to create a faster pure-SQL solution.
But to be clear: the main issue I'm interested in is indexing and querying the lists of phonemes.
The best structure is probably a table with one row per phoneme:
create table WordPhonemes (
WordPhonemeId int auto_increment primary key,
WordId int not null,
Position int not null,
Phoneme varchar(255),
foreign key (WordId) references Words(WordId),
index(WordId, Position),
index(Phoneme),
index(WordId, Phoneme)
);
Not all your queries will not necessarily be that simple. However, the indexes will help. In addition, if the Words
table has the total number of phonemes that will also help.