Search code examples
nlplanguage-model

how to treat with <s> and </s> in calculating unigram LM?


I am beginner in NLP and I'm confused how to treat with <s> and </s> symbols to calculate counts for unigram model? should I count them or just ignore?


Solution

  • If I understand correctly that <s> and </s> mean special (fake) unigrams as the first and the last unigrams (actually, pre-first and after-last) for each text, then there is no need in them for unigrams, because any string contains these unigrams and thus they provide no additional information.

    Such special unigrams can be useful in case of high-order n-grams: for example, it allows to extract from the 1-word string like hello 2 bigrams: <s> hello and hello </s> or 3 trigrams: <s0> <s1> hello, <s1> hello </s1>,hello </s1> </s0>.