Search code examples
pythonnlpnltkpos-tagger

How can I POS tag German texts?


I've been doing some natural language processing work.

For English POS tagging, it's rather simple because I only need to use built-in nltk functions. I want to process German texts similarly.

Since nltk doesn't have a built-in function for German, I've tried using Stanford POSTagger:

from nltk.tag.stanford import StanfordPOSTagger
import os
import nltk
java_path = "C:/Program Files/Java/jdk1.8.0_71/bin/java.exe"
os.environ['JAVAHOME'] = java_path
sentence = "Man könnte Klöckner vorhalten, sich an ihre eigenen Appelle nicht zu halten. Doch niemand in der Union wagte das. Nicht einmal die von ihr attackierten Briefschreiber. Klöckner genießt im Moment Narrenfreiheit."
tokens = nltk.word_tokenize(sentence, 'german')
german_postagger1 = StanfordPOSTagger(r'E:/python/nlptest/models/german-hgc.tagger', r'E:/python/nlptest/stanford-postagger.jar')
gp1 = german_postagger1.tag(tokens)

It takes almost 7 seconds to finish processing, which is unbearable for me.

I also tried the module Pattern, but it doesn't support Python 3 and I'm using Python 3.4.

Is there an alternative and faster way to POS tag German sentences?


Solution

  • TreeTagger is a fast easy-to-install well-documented decison-tree based tagger with support for many languages (and yeah, it's built by a German) and a python wrapper.