Search code examples
text-miningopennlp

What is the meaning of 'cut-off' and 'iteration' for trainings in OpenNLP?


what is the meaning of cut-off and iteration for training in OpenNLP? or for that matter natural language processing. I need just a layman explanation of these terms. As far as I think, iteration is the number of times the algorithm is repeated and cut off is a value such that if a text has value above this cut off for some specific category it will get mapped to that category. Am I right?


Solution

  • Correct, the term iteration refers to the general notion of iterative algorithms, where one sets out to solve a problem by successively producing (hopefully increasingly more accurate) approximations of some "ideal" solution. Generally speaking, the more iterations, the more accurate ("better") the result will be, but of course the more computational steps have to be carried out.

    The term cutoff (aka cutoff frequency) is used to designate a method of reducing the size of n-gram language models (as used by OpenNLP, e.g. its part-of-speech tagger). Consider the following example:

    Sentence 1 = "The cat likes mice."
    Sentence 2 = "The cat likes fish."
    Bigram model = {"the cat" : 2, "cat likes" : 2, "likes mice" : 1, "likes fish" : 1}
    

    If you set the cutoff frequency to 1 for this example, the n-gram model would be reduced to

    Bigram model = {"the cat" : 2, "cat likes" : 2}
    

    That is, the cutoff method removes from the language model those n-grams that occur infrequently in the training data. Reducing the size of n-gram language models is sometimes necessary, as the number of even bigrams (let alone trigrams, 4-grams, etc.) explodes for larger corpora. The remaning information (n-gram counts) can then be used to statistically estimate the probability of a word (or its POS tag) given the (n-1) previous words (or POS tags).