Search code examples
pythonmatplotlibnltk

While using WordCloud for Python, why is the frequency of the letter "S" considered in the cloud?


I'm getting to know the WordCloud package for Python and I'm testing it with the Moby Dick Text from NLTK. A snippet of this is as follows:

Snippet of my example string

As you can see from the highlights in the image, all of the possesive apostrophes have been escaped to "/'S" and WordCount seems to be including this in the frequency count as "S":

Frequency distribution of words

Of course this causes an issue because "S" is counted as a high frequency and all the other word's frequency are skewed in the cloud:

Example of my skewed cloud

In a tutorial that I'm following for the same Moby Dick string, the WordCloud doesn't seem to be counting the "S". Am I missing an attribute somewhere or do I have to manually remove "/'s" from my string?

Below is a summary of my code:

example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
word_list = ["".join(word) for word in example_corpus]
novel_as_string = " ".join(word_list)

wordcloud = WordCloud().generate(novel_as_string)
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")

plt.show()

Solution

  • In such application, usually use stopwords to filter the word list first, since you don't want simple words, such as a, an, the, it, ..., to dominate your result.

    changed the code a little bit, hope it helps. you can check the content of stopwords.

    import nltk
    from wordcloud import WordCloud
    import matplotlib.pyplot as plt
    from nltk.corpus import stopwords
    
    example_corpus = nltk.corpus.gutenberg.words("melville-moby_dick.txt")
    # word_list = ["".join(word) for word in example_corpus] # this statement seems like change nothing
    # using stopwords to filter words
    word_list = [word for word in example_corpus if word not in stopwords.words('english')]
    novel_as_string = " ".join(word_list)
    
    wordcloud = WordCloud().generate(novel_as_string)
    plt.imshow(wordcloud, interpolation="bilinear")
    plt.axis("off")
    
    plt.show()
    

    output: see wordcloud Imgur