Search code examples
pythontextnlppyenchantenchant

Grouping words by their similarity


I have a huge dictionary/dataframe of German words and how often they appeared in a huge text corpus. For example:

der                                23245
die                                23599
das                                23959
eine                               22000
dass                               18095
Buch                               15988
Büchern                             1000
Arbeitsplatz-Management              949
Arbeitsplatz-Versicherung            800

Since words like "Buch" (book) and "Büchern" (books, but in a different declension form) have similar meanings, I want to add up their frequencies. Same thing with the articles "der, die, das", but not with the last two words that have completely different meanings even if they stem from the same words.

I tried the Levenshtein distance, which is "the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other." But I get bigger Levenshtein distances between "Buch" and "Bücher" than between "das" and "dass" (completely different meanings)

import enchant
string1 = "das"
string2 = "dass"
string3 = "Buch"
string4 = "Büchern"
print(enchant.utils.levenshtein(string1, string2))
print(enchant.utils.levenshtein(string3, string4))
>>>> 1
>>>> 4

Is there any other way to cluster such words efficiently?


Solution

  • First, Buch and Bücher is pretty simple as they are just different morphologies of the same word. For both Buch and Bücher, there is only one version in the dictionary (called a lemma). As it happens, der, die and das are also just different morphologies of the lemma der. We just need to count the dictionary form of words (the lemmas) . Spacy has an easy way to access the lemma of a word, for example:

    import spacy
    from collections import Counter
    
    nlp = spacy.load('de')
    words = ['der', 'die', 'das', 'eine', 'dass', 'Buch', 'Büchern', 'Arbeitsplatz-Management','Arbeitsplatz-Versicherung']
    lemmas = [nlp(a)[0].lemma_ for a in words]
    counter = Counter(lemmas)
    

    results in counter:

    Counter({'der': 3, 'einen': 1, 'dass': 1, 'Buch': 2, 'Arbeitsplatz-Management': 1, 'Arbeitsplatz-Versicherung': 1})