I've recently begun working on a sentiment analysis project on German texts and I'm planning on using a stemmer to improve the results.
NLTK comes with a German Snowball Stemmer and I've already tried to use it, but I'm unsure about the results. Maybe it should be this way, but as a computer scientist and not a linguist, I have a problem with inflected verb forms stemmed to a different stem.
Take the word "suchen" (to search), which is stemmed to "such" for 1st person singular but to "sucht" for 3rd person singular.
I know there is also lemmatization, but no working German lemmatizer is integrated into NLTK as far as I know. There is GermaNet, but their NLTK integration seems to have been aborted.
Getting to the point: I would like inflected verb forms to be stemmed to the same stem, at the very least for regular verbs within the same tense. If this is not a useful requirement for my goal, please tell me why. If it is, do you know of any additional resources to use which can help me achieve this goal?
Edit: I forgot to mention, any software should be free to use for educational and research purposes.
As a computer scientist, you are definitely looking in the right direction to tackle this linguistic issue ;). Stemming is usually quite a bit more simplistic, and used for Information Retrieval tasks in an attempt to decrease the lexicon size, but usually not sufficient for more sophisticated linguistic analysis. Lemmatisation partly overlaps with the use case for stemming, but includes rewriting for example verb inflections all to the same root form (lemma), and also differentiating "work" as a noun and "work" as a verb (although this depends a bit on the implementation and quality of the lemmatiser). For this, it usually needs a bit more information (like POS-tags, syntax trees), hence takes considerably longer, rendering it less suitable for IR tasks, typically dealing with larger amounts of data.
In addition to GermaNet (didn't know it was aborted, but never really tried it, because it is free, but you have to sign an agreement to get access to it), there is SpaCy which you could have a look at: https://spacy.io/docs/usage/
Very easy to install and use. See install instructions on the website, then download the German stuff using:
python -m spacy download de
then:
>>> import spacy
>>> nlp = spacy.load('de')
>>> doc = nlp('Wir suchen ein Beispiel')
>>> for token in doc:
... print(token, token.lemma, token.lemma_)
...
Wir 521 wir
suchen 1162 suchen
ein 486 ein
Beispiel 809 Beispiel
>>> doc = nlp('Er sucht ein Beispiel')
>>> for token in doc:
... print(token, token.lemma, token.lemma_)
...
Er 513 er
sucht 1901 sucht
ein 486 ein
Beispiel 809 Beispiel
As you can see, unfortunately it doesn't do a very good job on your specific example (suchen), and I'm not sure what the number represents (i.e. must be the lemma id, but not sure what other information can be obtained from this), but maybe you can give it a go and see if it helps you.