Search code examples
pythonnlpnltkn-gram

Ngrams with non-symmetrical padding in NLTK


The quad gram of the word TEXT is

>>generated_ngrams = ngrams('TEXT', 4, pad_left=True, pad_right=True, left_pad_symbol=' ', right_pad_symbol=' ')

>>list(generated_ngrams)
[(' ', ' ', ' ', 'T'), (' ', ' ', 'T', 'E'), (' ', 'T', 'E', 'X'), ('T', 'E', 'X', 'T'), ('E', 'X', 'T', ' '), ('X', 'T', ' ', ' '), ('T', ' ', ' ', ' ')]

According to me the output should have been _TEX, TEXT, EXT__, XT__. According to this website (http://cloudmark.github.io/Language-Detection/) the output is _TEX, TEXT, EXT_, XT__, T___
It also goes on to say "In general a string of length k, padded with blanks, will have k+1 bi-grams, k+1 tri-grams, k+1 quad-grams and so on."
Based on the output I got from Python I don't think that is valid.
Kindly explain.


Solution

  • Padding ensures that each symbol of the actual string occurs at all positions of the ngram. So for 4-grams there will be three padded ngrams of the last symbol, E X T _, X T _ _, and T _ _ _, etc. as your code shows you.

    The website you link to adds one space on the left, then pads properly on the right. That's why the counts are different. This gives the same number of ngrams for all lengths. This is the corresponding Python code:

    generated_ngrams = ngrams(" " + 'TEXT', 4, 
                              pad_left=False, pad_right=True, right_pad_symbol=' ')
    

    Why it was done this way, only the author of the blog really knows. But one consequence of padding out on the right but not the left is that, as the blog points out, a given string of length k will produce a fixed number of n-grams (k+1) for any n-gram size n. The initial space doesn't contribute to this, but serves as a word boundary sign: ngrams that start with a space are word-initial.