Search code examples
nlppos-taggeroov

Part of speech tagging : tagging unknown words


In the part of speech tagger, the best probable tags for the given sentence is determined using HMM by

    P(T*) = argmax P(Word/Tag)*P(Tag/TagPrev)
              T

But when 'Word' did not appear in the training corpus, P(Word/Tag) produces ZERO for given all possible tags, this leaves no room for choosing the best.

I have tried few ways,

1) Assigning small amount of probability for all unknown words, P(UnknownWord/AnyTag)~Epsilon... means this completely ignores the P(Word/Tag) for unknowns word by assigning the constant probability.. So decision making on unknown word is by prior probabilities.. As expected it is not producing good result.

2) Laplace Smoothing I confused with this. I don't know what is difference between (1) and this. My way of understanding Laplace Smoothing adds the constant probability(lambda) to all unknown & Known words.. So the All Unknown words will get constant probability(fraction of lambda) and Known words probabilities will be the same relatively since all word's prob increased by Lambda. Is the Laplace Smoothing same as the previous one ?

*)Is there any better way of dealing with unknown words ?


Solution

  • Your two approaches are similar, but, if I understand correctly, they differ in one key way. In (1) you are assigning extra mass to counts of unknown words and in (2) you are assigning extra mass to all counts. You definitely want to do (2) and not (1).

    One of the problems with Laplace smoothing is that it give too much of a boost to unknown words and drags down the probabilities of high-probability words too much (relatively speaking). Your version (1) would actually worsen this problem. Basically, it would over-smooth.

    Laplace smoothing words ok for an HMM, but it's not great. Most people do add-one smoothing but you could experiment with things like add-one-half or whatever.

    If you want to move beyond this naive approach to smoothing, check out "one-count smoothing", as described in the Appendix of Jason Eisner's HMM tutorial. The basic idea here is that for unknown words more probability mass should be given to tags that appear with a wider variety of low frequency words. For example, since the tag NOUN appears on a large number of different words and DETERMINER appears on a small number of different words, it is more likely that an unseen word will be a NOUN.

    If you want to get even fancier, you could use a Chinese Restaurant Process model taken from non-parametric Bayesian statistics to put a prior distribution on unseen word/tag combinations. Kevin Knight's Bayesian inference tutorial has details.