Search code examples
pythonnlpnltk

real word count in NLTK


The NLTK book has a couple of examples of word counts, but in reality they are not word counts but token counts. For instance, Chapter 1, Counting Vocabulary says that the following gives a word count:

text = nltk.Text(tokens)
len(text)

However, it doesn't - it gives a word and punctuation count. How can you get a real word count (ignoring punctuation)?

Similarly, how can you get the average number of characters in a word? The obvious answer is:

word_average_length =(len(string_of_text)/len(text))

However, this would be off because:

  1. len(string_of_text) is a character count, including spaces
  2. len(text) is a token count, excluding spaces but including punctuation marks, which aren't words.

Am I missing something here? This must be a very common NLP task...


Solution

  • Removing Punctuation

    Use a regular expression to filter out the punctuation

    import re
    from collections import Counter
    
    >>> text = ['this', 'is', 'a', 'sentence', '.']
    >>> nonPunct = re.compile('.*[A-Za-z0-9].*')  # must contain a letter or digit
    >>> filtered = [w for w in text if nonPunct.match(w)]
    >>> counts = Counter(filtered)
    >>> counts
    Counter({'this': 1, 'a': 1, 'is': 1, 'sentence': 1})
    

    Average Number of Characters

    Sum the lengths of each word. Divide by the number of words.

    >>> float(sum(map(len, filtered))) / len(filtered)
    3.75
    

    Or you could make use of the counts you already did to prevent some re-computation. This multiplies the length of the word by the number of times we saw it, then sums all of that up.

    >>> float(sum(len(w)*c for w,c in counts.iteritems())) / len(filtered)
    3.75