I'm getting to know the WordCloud package for Python and I'm testing it with the Moby Dick Text from NLTK. A snippet of this is as follows:
As you can see from the highlights in the image, all of the possesive apostrophes have been escaped to "/'S" and WordCount seems to be including this in the frequency count as "S":
Frequency distribution of words
Of course this causes an issue because "S" is counted as a high frequency and all the other word's frequency are skewed in the cloud:
In a tutorial that I'm following for the same Moby Dick string, the WordCloud doesn't seem to be counting the "S". Am I missing an attribute somewhere or do I have to manually remove "/'s" from my string?
Below is a summary of my code:
example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
In such application, usually use stopwords
to filter the word list first, since you don't want simple words, such as a, an, the, it, ...
, to dominate your result.
changed the code a little bit, hope it helps. you can check the content of stopwords
.
import nltk
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
# word_list = ["".join(word) for word in example_corpus] # this statement seems like change nothing
# using stopwords to filter words
word_list = [word for word in example_corpus if word not in stopwords.words('english')]
novel_as_string = " ".join(word_list)
wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
output: see wordcloud Imgur