Search code examples
stanford-nlpdataformat

Data format for Stanford POS-tagger


I am re-training the Stanford POS-tagger on my own data. I have trained two other taggers on the same data in the following one-token-per-line format:

word1_TAG
word2_TAG
word3_TAG
word4_TAG
.

Is this format ok for the Stanford tagger, or does it need to be one-sentence-per-line?

word1_TAG word2_TAG word3_TAG word4_TAG .

Could using the first format for training and testing affect Stanford tagging results?


Solution

  • You should have one sentence per line (your second example).

    Using the first format will certainly affect tagging results: you'll effectively build a unigram tagger, in which all tagging is done without any sentence context at all.