Search code examples
pythonnltkcorpusstemminglemmatization

Python NLTK: search for occurrence of a word


I use the brown corpus "brown.words()" which gives me a list of 1161192 words.

Now I want to find any occurrence of the word "have", so whenever in the corpus there is an "has", "had", "haven't" ect. I want to do something (could be pushing them into an array, could be a counter, could be something else.

Edit: Note that this question is about finding a matching word. If I search "have" I want a way to match it to "haven't" or "had", thus the .count() would not solve this problem as it dosen't help matching anything.

Example code I would use in case stemming/lemmatization would work:

def findWordFamily(findWord):
    wordFamily = []

    lmtzr = WordNetLemmatizer()

    findWord = lmtzr.lemmatize(findWord)
    for word in brown.words():
        lemma = lmtzr.lemmatize(word)
        if lemma == findWord:
            wordFamily.append(word)

    return wordFamily
print(findWordFamily("have"))
# ["have", "have", "had", "having","haven't", "having"]

But the problem is that:

for word in brown.words():
    lemma = lmtzr.lemmatize(word)
    # if word is "having" lemma also is "having" instead of "have"

Solution

  • Before trying to match the words, you might want to do a little of pre-processing. So "has" or "haven't" end up "transformed" to "have".

    I recommend you take a look at both stemming or lemmatizing:

    NLTK's Wordnet Lemmatizer (one of my favorites): http://www.nltk.org/_modules/nltk/stem/wordnet.html

    NLTK's stemmers: http://www.nltk.org/howto/stem.html

    Note: for the lemmatizer to work well with verbs, you have to specify that they are in fact verbs.

    nltk.stem.WordNetLemmatizer().lemmatize('having', 'v')
    

    Hope this helps!