Search code examples
pythonnlpnltkstemmingsnowball-stemmer

German stemmer is not removing feminine suffixes "-in" and "-innen"


In German, every job has a feminine and a masculine version. The feminine one is derived from the masculine one by adding an "-in" suffix. In the plural form, this turns into "-innen".

Example:

      | English          | German
------+------------------+-----------------------
masc. | teacher  doctor  | Lehrer      Arzt
fem.  | teacher  doctor  | Lehrerin    Ärztin
masc. | teachers doctors | Lehrer      Ärzte
fem.  | teachers doctors | Lehrerinnen Ärztinnen

Currently, I'm using NLTK's nltk.stem.snowball.GermanStemmer. It returns these stems:

Lehrer      -> lehr      | Arzt      -> arzt
Lehrerin    -> lehrerin  | Ärztin    -> arztin
Lehrer      -> lehr      | Ärzte     -> arzt
Lehrerinnen -> lehrerinn | Ärztinnen -> arztinn

Is there a way to make this stemmer return the same stems for all four versions, feminine and masculine ones? Alternatively, is there any other stemmer doing that?

Update

I ended up adding "-innen" and "-in" as the first entries in the step 1 suffix-tuple like so:

stemmer = GermanStemmer()
stemmer._GermanStemmer__step1_suffixes = ("innen", "in") + stemmer._GermanStemmer__step1_suffixes

This way all of the above words are stemmed to lehr and arzt respectively. Also, all other "job-forms" that I tried so far are stemmed correctly, meaning masculine and feminine forms have the same stem. Also, if the "job-form" is derived from a verb, like Lehrer/in, they have the same stem as the verb.


Solution

  • The German Snowball stemmer follows a three step process:

    1. Remove ern, em, er, en, es, e, s suffixes
    2. Remove est, en, er, st suffixes
    3. Remove isch, lich, heit, keit, end, ung, ig, ik suffixes

    Not knowing a lot about German grammar, it seems that in would belong to the same class as the step 3 suffixes (these are referred to as "derivational suffixes" in the NLTK source). It would seem that adding in to this list of suffixes should force the Snowball stemmer to remove it but there are two problems.

    The first problem is that from your examples I see that in becomes inn when followed by en. This could be worked around by adding both in and inn to the list of step 3 suffixes, but that doesn't solve the second problem.

    Looking at the GermanStemmer.stem() source, each step will only remove a single suffix. Thus, if there is more than one derivational suffix (i.e. in plus any of the suffixes listed above], only the one will be removed.

    In such cases (and I don't know enough about German to know if this can actually happen), you'd need to manually edit GermanStemmer.stem() to add a fourth "in removal" step. This would also allow finer control in the case of plurals. But honestly, at that point it's probably better to just ad hoc remove in by wrapping your GermanStemmer.stem() call. For example:

    from nltk.stem.snowball import GermanStemmer
    
    def stem_german(word):
        plural = word.endswith("en") #for deciding if we are looking for "in" or "inn"
        stemmed_word = GermanStemmer().stem(word)
    
        feminine_suffix = "in" if not plural else "inn"
        if stemmed_word.endswith(feminine_suffix):
            stemmed_word = stemmed_word[:-len(feminine_suffix)]
    
        return stemmed_word
    

    --Edit--

    If you wanted to add in to one of the Snowball Stemmer steps, you can do so using:

    #Using nltk.stem.snowball.SnowballStemmer
    stemmer = SnowballStemmer("german")
    stemmer.stemmer._GermanStemmer__step3_suffixes += ("in",) #add "in" to the step 3 suffixes
    
    #Using nltk.stem.snowball.GermanStemmer
    stemmer = GermanStemmer()
    stemmer._GermanStemmer__step3_suffixes += ("in",)
    

    Note the comma after "in". This code will not work without it. You can also replace the 3 with whichever step you wish to modify. I'm not entirely sure why it's _GermanStemmer__step3_suffixes and not just __step3_suffixes but I've verified that this code works on Python 3.6.4 and NLTK 3.2.5.

    I would not recommend this approach, though, as it will not properly deal with innen. Also, since each step removes a maximum of one suffix, it will not properly deal with words like Lehrerinnen which have en, in, and er (step 3 doesn't check for er). I think your best bet is to just copy and paste the entirety of GermanStemmer (found in the source code link above. Use ctrl+f) and add a step 2.5 to stem() that checks for and removes in/inn.