database database-design libreoffice-base

Relational database design for hierarchical data?

I am a trying to design a database to act as a language dictionary where each word is associated not only to its definition by also to its grammatical "taxon". E.g., it should look something like this:

"eat": verb.imperative
"eat": verb.present
"ate": verb.past
"he": pronoun.masculine.singular
"she": pronoun.feminine.singular
"heiress": noun.feminine.singular
"heirs": noun.masculine.plural
"therefore": adverb
"but": conjunction

It seems that a natural data structure to hold such a grammatical "taxonomy" should be some kind of tree or graph. Although I haven't thought it through, I presume that should make it easier to perform queries of the type

plural OF masculine OF "heiress" -> "heirs"

At this point, however, I am just trying to come up with the least ineffective way to store such a dictionary in a regular relational database (namely a LibreOffice Base). What do you suggest the data schema should be like? Is there something more efficient than the brute force method where I'd have as many boolean columns as there are grammatical types and sub-types? E.g., "she" would be true for the columns pronoun, feminine, and singular, but false for all other column (verbs, adverb, conjunction, etc.)?

Solution

This is a really wide-open question, and there are many applications and much related research. Let me give some pointers based on software I have used.

One column would be the lexeme, for example "eat." A second column would give the part of speech, which in your data above would be a string or other identifier that shows whether it is a verb, pronoun, noun, adverb or conjunction.

It might make sense to create another table for verb information. For example, tense, aspect and mood might each be separate columns. But these columns would only make sense for verbs. For the nouns table, the columns would include number (singular, plural) and gender, and perhaps whether it is a count or mass noun. Pronouns would also include person (first, second or third person).

Do you plan to include every form of every word? For example, will this database store "eats" and "eating" as well as "jumps" and "jumping?" It is much more efficient to store rules like "-s" for present singular and "-ing" for progressive. Then if there are exceptions, for example "ate," it can be described as having the underlying form of "eat" + "-ed." This rule would go under the "eat" lexeme, and there would be no separate "ate" entry.

Also there are rules such as that plural changes words that end in y to -ies. This would go under the plural noun suffix ("-s"), not individual verbs.

With these things in mind, I offer a more specific answer to your question: No, I do not think this data is best described hierarchically, nor with a tree or graph, but rather analytically and relationally. LibreOffice Base would be a reasonable choice for a fairly simple project of this type, using macros to help with the processing.

So for:

"heiress" -> masculine plural = "heirs"

The first thing to do would be to analyze "heiress" as "heir" + feminine. Then compose the desired wordform by combining "heir" and "-s."

I was going to add a list of related software such as Python NLTK, but for one thing, the list of available software is nearly endless, and for another, software recommendations are off-topic for stackoverflow.