Search code examples
nlpstanford-nlppos-tagger

How can I give some POS information before Stanford NLP POS tagger execute?


If I already know some word's POS information.

eg:I know st316(my id) is a Proper nouns (NR).In the sentence"I am st316." How can I make tagger use the Information that st316 is a NR,then decide the POS information of other words(I am).

Just like,

Input:I am st316/NR .

Output: I/PN am/VC st316/NR ./PU

Help me.Really thanks!


Solution

  • I can think of 2 options:

    1. (easy) Let the tagger do its magic and then overwrite its output. If you know st316 must be tagged as X and Stanford failed to tag it as such, change the tag of st316 to X. The disadvantage of this approach is that the tagger is not able to use that information to better tag the rest of the sentence.
    2. (harder) Retrain the PoS tagger, adding the extra information you have to its training data. This way it will actually learn from the information you provide and will be able to make use of it. The drawback is you will need to obtain some training data and (depending on how much data you get) it may take a while to train a new model.

    If you go with option 2, you need to format your data as follows:

    An_DT avocet_NN is_VBZ a_DT small_JJ ,_, cute_JJ bird_NN ._.
    I_PRP am_VBP st316_NNP ._.
    I_PRP am_VBP st316_NNP ._.
    I_PRP am_VBP st316_NNP ._.
    I_PRP am_VBP st316_NNP ._.
    I_PRP am_VBP st316_NNP ._.
    

    The first line is taken from the Stanford FAQ. The rest is your extra knowledge. Note the one extra sentence is repeated. This is in order to add pseudo-counts to that observation. Informally, if you only included st316_NNP once in the training data chances are the tagger will think it is noise/error and ignore it. Repeating is is like saying "Yes, I am sure, I know what I'm doing, learn from that data". Depending on how much data you have, you will need anywhere between 5 and 50 repetitions to ensure the tagger learns properly.