I'm trying to use the Stanford POS tagger to tag some French text. To do that, I use the following command:
cat file.txt | java -mx10000m -cp 'stanford-postagger.jar:' edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/french.tagger -sentenceDelimiter newline > output.txt
(There is one sentence per line.)
But I noticed that the tags were pretty bad, and that the real issue actually comes from the French tokenization itself. I think that the tokenization is done by an English tokenizer.
So I tried to only tokenize the text in French by doing this:
cat file.txt | java -mx10000m -cp 'stanford-postagger.jar:' edu.stanford.nlp.international.french.process.FrenchTokenizer -sentenceDelimiter newline > tokenized.txt
And there the French tokens are good.
How can I tell the tagger to use the French model for tagging, but also the French tokenizer at the same time?
You can use the -tokenizerFactory
and -tokenizerOptions
flags to control tokenization. The "Tagging and Testing from the command line" section of the javadoc for MaxentTagger has a complete list of available options.
I believe the following command will do what you want:
java -mx10000m -cp 'stanford-postagger.jar:' \
edu.stanford.nlp.tagger.maxent.MaxentTagger \
-model models/french.tagger \
-tokenizerFactory 'edu.stanford.nlp.international.french.process.FrenchTokenizer$FrenchTokenizerFactory' \
-sentenceDelimiter newline