Search code examples
pythonnlpnltkcollocation

NLTK: Find contexts of size 2k for a word


I have a corpus and I have a word. For each occurrence of the word in the corpus I want to get a list containing the k words before and the k words after the word. I am doing this algorithmically OK (see below) but I wondered whether NLTK is providing some functionality for my needs that I missed?

def sized_context(word_index, window_radius, corpus):
    """ Returns a list containing the window_size amount of words to the left
    and to the right of word_index, not including the word at word_index.
    """

    max_length = len(corpus)

    left_border = word_index - window_radius
    left_border = 0 if word_index - window_radius < 0 else left_border

    right_border = word_index + 1 + window_radius
    right_border = max_length if right_border > max_length else right_border

    return corpus[left_border:word_index] + corpus[word_index+1: right_border]

Solution

  • The simplest, nltk-ish way to do this is with nltk.ngrams().

    words = nltk.corpus.brown.words()
    k = 5
    for ngram in nltk.ngrams(words, 2*k+1, pad_left=True, pad_right=True, pad_symbol=" "):
        if ngram[k+1].lower() == "settle":
            print(" ".join(ngram))
    

    pad_left and pad_right ensure that all words get looked at. This is important if you don't let your concordances span across sentences (hence: lots of boundary cases).

    If you want to ignore punctuation in the window size, you can strip it before scanning:

    words = (w for w in nltk.corpus.brown.words() if re.search(r"\w", w))