I'd like to write a function same_base(word1, word2)
that returns True
when word1
and word2
are two English words derived from the same root word. I realize that words can have multiple senses; I want the algorithm to be overzealous, returning True
whenever it is possible to view the words as originating from the same place. Some false positives are OK; false negatives are not.
Typically, stemming and lemmatization would be used for this. Here's what I've tried:
sung
and sing
, dig
and dug
, medication
and medicine
.Does such a tool exist? Do I just need an extremely aggressive stemmer / lemmatizer combo — and if so, where would I find one?
The general task, as you've described it, is not possible from simple textual analysis of the input characters. English does not have consistent rules for handling words as they evolve. Yes, an excellent lemmatiser will solve the straightforward cases for you, those that can be discerned by applying transformations common within that POS (such as irregular verbs).
However, to eliminate false negatives, you must have complete coverage of the word's basis; complete will require etymology, especially in cases where the root word isn't in the English language, or perhaps doesn't appear in the shortened word itself.
For instance, what software tool could tell you that dis
and speculum
have the same root (specere
), but that species
does not? How would you tell that gentle
, gentile
, genteel
, and jaunty
have the same root? You'll need the etymology to get 100% of the actual connections.