recently I'm traing to build service for lemmatization of german words.
I found very good article here
After I've done all the steps described in the article my service works quite good, but while testing I noticed that some of verbs cannot be converted to infinitive form.
E.g. kochst -> kochen. The root cause is that my POS tagger gives me ADV for 'kochst' while should VVFIN or at least V... since this is a verb.
I also found that original TIGER corpus file doesn't contain 'kochst' form but only 'kocht'.
I am not familiar with conll format, but have added one additional row which is shown below
50475_11 kochst kochen _ VVFIN _ number=sg|person=2|tense=pres|mood=ind _ 0 _ -- _ _ _ _
and retrained the tagger without any success, see listing below
>>> import nltk
>>> corp = nltk.corpus.ConllCorpusReader('.', 'tiger_release_aug07.corrected.16012013.conll09',
... ['ignore', 'words', 'ignore', 'ignore', 'pos'],
... encoding='utf-8')
>>>
>>> tagged_sents = corp.tagged_sents()
>>>
>>> from ClassifierBasedGermanTagger.ClassifierBasedGermanTagger import ClassifierBasedGermanTagger
>>> tagger = ClassifierBasedGermanTagger(train=tagged_sents)
>>> tagger.tag(['kochst'])
[('kochst', u'ADV')]
>>>
>>>
>>> tagged_sents[-1]
[(u'kochst', u'VVFIN')]
So it might be I added 'kochst' record incorrectly or TIGER corpus is not complete (I found that many verbs in second person form are not there) or I simply don't understand something here, how to train POS tagger to return verb for conjugated verbs.
'kochst' just an example, I guess there a lot other verbs cannot be recognized
>>> tagger.tag(['fahre'])
[('fahre', u'XY')]
>>> tagger.tag(['musst'])
[('musst', u'PPER')]
TIGER only contains newspaper text, so there aren't a lot of non-3rd person verbs. A statistical model is not going to be able to learn much about verb endings it's barely seen.
Things that could help:
Choose a better tagger. The one you mentioned has a somewhat limited set of features, especially in terms of prefixes and suffixes. I'm not familiar with all the options in NLTK (there may be some that are equally good), but as an alternative I'd suggest trying marmot for tagging, plus lemming for lemmatization, from http://cistern.cis.lmu.de, which are relatively fast and easy to use. There are also plenty of newer tagging approaches that may be a little better, but it's hard to tell how they compare because many recent evaluations are based on the UD German corpus, which unfortunately has relatively low quality annotation.
Taggers rely on context, so when you add some new training data it helps to add whole sentences, or at least whole phrases.
Even a large manually-annotated corpus isn't going to have coverage of plenty of word forms, so lexical resources are very helpful for lemmatization. I'd take a look a Zmorge, a morphological analyzer based on data from German wiktionary. If your main goal is lemmatization, I'd recommend starting with something like Zmorge and backing off to statistical models for ambiguous or unseen words.