bad tokenization in stanford postagger

I'm trying to use the Stanford POS tagger to tag some French text. To do that, I use the following command:

cat file.txt | java -mx10000m -cp 'stanford-postagger.jar:' edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/french.tagger -sentenceDelimiter newline > output.txt

(There is one sentence per line.)

But I noticed that the tags were pretty bad, and that the real issue actually comes from the French tokenization itself. I think that the tokenization is done by an English tokenizer.

So I tried to only tokenize the text in French by doing this:

cat file.txt | java -mx10000m -cp 'stanford-postagger.jar:' edu.stanford.nlp.international.french.process.FrenchTokenizer -sentenceDelimiter newline > tokenized.txt

And there the French tokens are good.

How can I tell the tagger to use the French model for tagging, but also the French tokenizer at the same time?

Solution

You can use the -tokenizerFactory and -tokenizerOptions flags to control tokenization. The "Tagging and Testing from the command line" section of the javadoc for MaxentTagger has a complete list of available options.

I believe the following command will do what you want:

java -mx10000m -cp 'stanford-postagger.jar:' \
  edu.stanford.nlp.tagger.maxent.MaxentTagger \
  -model models/french.tagger \
  -tokenizerFactory 'edu.stanford.nlp.international.french.process.FrenchTokenizer$FrenchTokenizerFactory' \
  -sentenceDelimiter newline