I use the brown corpus "brown.words()" which gives me a list of 1161192 words.
Now I want to find any occurrence of the word "have", so whenever in the corpus there is an "has", "had", "haven't" ect. I want to do something (could be pushing them into an array, could be a counter, could be something else.
Edit: Note that this question is about finding a matching word. If I search "have" I want a way to match it to "haven't" or "had", thus the .count() would not solve this problem as it dosen't help matching anything.
Example code I would use in case stemming/lemmatization would work:
def findWordFamily(findWord):
wordFamily = []
lmtzr = WordNetLemmatizer()
findWord = lmtzr.lemmatize(findWord)
for word in brown.words():
lemma = lmtzr.lemmatize(word)
if lemma == findWord:
wordFamily.append(word)
return wordFamily
print(findWordFamily("have"))
# ["have", "have", "had", "having","haven't", "having"]
But the problem is that:
for word in brown.words():
lemma = lmtzr.lemmatize(word)
# if word is "having" lemma also is "having" instead of "have"
Before trying to match the words, you might want to do a little of pre-processing. So "has" or "haven't" end up "transformed" to "have".
I recommend you take a look at both stemming or lemmatizing:
NLTK's Wordnet Lemmatizer (one of my favorites): http://www.nltk.org/_modules/nltk/stem/wordnet.html
NLTK's stemmers: http://www.nltk.org/howto/stem.html
Note: for the lemmatizer to work well with verbs, you have to specify that they are in fact verbs.
nltk.stem.WordNetLemmatizer().lemmatize('having', 'v')
Hope this helps!