Search code examples
rtmword-cloud

How to remove odd characters in R wordcloud


I'm trying to build a word cloud in R using corpus and various tm_map functions. The problem is I keep getting this odd symbol returned, the one with the euro symbol and the upside down quotes. It's coming up as second most frequent in my corpus. (There are one or two others but they're nowhere near as frequent so less of a problem.)

Word cloud with rogue €“

Any ideas how to get rid of this?

This is a sample of the text in .txt format before it is pulled into R:

The move to Virtual Replication 6 added replication in and out of AWS where that had only previously been one-way, into the Amazon cloud storage platform. It had taken longer to develop in AWS, said Zerto technology evangelist Gjisbert Janssen van Doorn. “Bi-directional replication to and from Azure was where we started. We try to develop natively via APIs for the clouds we support but that had taken longer with AWS.” Zerto has also added bi-directional replication with IBM Cloud. van Doorn said the company had no plan to add support for Google Cloud Platform. “It’s something we’re keeping an eye on. It’s on the wishlist rather than the roadmap,” he said.

This is how it comes out after being pulled in to R via Corpus():

The move to Virtual Replication 6 added replication in and out of AWS where that had only previously been one-way, into the Amazon cloud storage platform.\n\nIt had taken longer to develop in AWS, said Zerto technology evangelist Gjisbert Janssen van Doorn. “Bi-directional replication to and from Azure was where we started. We try to develop natively via APIs for the clouds we support but that had taken longer with AWS.â€\u009d\n\nZerto has also added bi-directional replication with IBM Cloud. van Doorn said the company had no plan to add support for Google Cloud Platform. “It’s something we’re keeping an eye on. It’s on the wishlist rather than the roadmap,â€\u009d he said.

Then I run this code:

# Convert the text to lower case
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove english common stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
# Remove your own stop word
# specify your stopwords as a character vector
corpus <- tm_map(corpus, removeWords, c("new", "products", "way", "back", 
"can", "need", "also", "â", "look", "will", "one", "right",
                                    "move", "gorge", "mathieu", "like", 
"said", "€“", "–", "â", "data",
                                    "use", "storage"))
# Remove punctuations
corpus <- tm_map(corpus, removePunctuation)
# Eliminate extra white spaces
corpus <- tm_map(corpus, stripWhitespace)

After that the same body of text looks like this:

virtual replication added replication aws previously oneway amazon cloud platform taken longer develop aws zerto technology evangelist gjisbert janssen van doorn €œbidirectional replication azure started try develop natively via apis clouds support taken longer awsâ€\u009d zerto added bidirectional replication ibm cloud van doorn company plan add support google cloud platform €œit’s something we’re keeping eye it’s wishlist rather roadmap

So, those tm_map functions haven't got rid of all the junk so the word cloud I run from this still contains them.

Any ideas how to fix this?


Solution

  • If you don't mind using an extra package, you can use the textclean package, this works nicely in combination with the tm functions. This package contains all kinds of useful functions for cleaning text with weird characters, urls, emoticons, etc. For the example text you need to use the functions replace_curly_quote for removing ” and ’ characters and replace_contraction to replace "it's" to "it is". See a working example below. After all of that you can just use the wordcloud package to create a wordcloud.

    txt <- "The move to Virtual Replication 6 added replication in and out of AWS where that had only previously been one-way, into the Amazon cloud storage platform. It had taken longer to develop in AWS, said Zerto technology evangelist Gjisbert Janssen van Doorn. “Bi-directional replication to and from Azure was where we started. We try to develop natively via APIs for the clouds we support but that had taken longer with AWS.” Zerto has also added bi-directional replication with IBM Cloud. van Doorn said the company had no plan to add support for Google Cloud Platform. “It’s something we’re keeping an eye on. It’s on the wishlist rather than the roadmap,” he said."
    
    library(tm)
    library(textclean)
    
    corpus <- VCorpus(VectorSource(txt))
    corpus <- tm_map(corpus, content_transformer(tolower))
    
    # function from textclean to remove curly quotes ” and ’
    corpus <- tm_map(corpus, replace_curly_quote)
    # function from textclean to replace "it's" to "it is"
    corpus <- tm_map(corpus, replace_contraction)
    
    # Remove punctuations
    corpus <- tm_map(corpus, removePunctuation)
    
    # Remove numbers
    corpus <- tm_map(corpus, removeNumbers)
    
    # Remove english common stopwords
    corpus <- tm_map(corpus, removeWords, stopwords("english"))
    
    my_stopwords <- c("new", "products", "way", "back", "can", "need", "also", 
                      "look", "will", "one", "right","move", "gorge", "mathieu", 
                      "like", "said", "data","use", "storage")
    
    corpus <- tm_map(corpus, removeWords, my_stopwords)
    
    #remove created whitespaces
    corpus <- tm_map(corpus, stripWhitespace)
    
    content(corpus)
    [[1]]
    [1] " virtual replication added replication aws previously oneway amazon cloud platform taken longer develop aws zerto technology evangelist gjisbert janssen van doorn bidirectional replication azure started try develop natively via apis clouds support taken longer aws zerto added bidirectional replication ibm cloud van doorn company plan add support google cloud platform something keeping eye wishlist rather roadmap "