I am trying to learn how to do some text analysis with twitter data. I am running into an issue when creating a Term Frequency Matrix. I create the Corpus out of spanish text (with special characters), with no issues.
However, when I create the Term Frequency Matrix (either with quanteda or tm libraries) the spanish characters do not display as expected (instead of seeing canción, I see canción).
Any suggestions on how I can get the Term Frequency Matrix to store the text with the correct characters?
Thank you for any help.
As a note: I prefer using the quanteda library, since ultimately I will be creating a wordcloud, and I think I better understand this library's approach. I am also using a Windows machine.
I have tried Encoding(tw2) <- "UTF-8" with no luck.
library(dplyr)
library(tm)
library(quanteda)
#' Creating a character with special Spanish characters:
tw2 <- "RT @None: Enmascarados, si masduro chingán a tarek. Si quieres ahora, la aguantas canción . https://t."
#Cleaning the tweet, removing special punctuation, numbers http links,
extra spaces:
clean_tw2 <- tolower(tw2)
clean_tw2 = gsub("&", "", clean_tw2)
clean_tw2 = gsub("(rt|via)((?:\\b\\W*@\\w+)+)", "", clean_tw2)
clean_tw2 = gsub("@\\w+", "", clean_tw2)
clean_tw2 = gsub("[[:punct:]]", "", clean_tw2)
clean_tw2 = gsub("http\\w+", "", clean_tw2)
clean_tw2 = gsub("[ \t]{2,}", "", clean_tw2)
clean_tw2 = gsub("^\\s+|\\s+$", "", clean_tw2)
# creates a vector with common stopwords, and other words which I want removed.
myStopwords <- c(stopwords("spanish"),"tarek","vez","ser","ahora")
clean_tw2 <- (removeWords(clean_tw2,myStopwords))
# If we print clean_tw2 we see that all the characters are displayed as expected.
clean_tw2
#'Create Corpus Using quanteda library
corp_quan<-corpus(clean_tw2)
# The corpus created via quanteda, displays the characters as expected.
corp_quan$documents$texts
#'Create Corpus Using TD library
corp_td<-Corpus(VectorSource(clean_tw2))
#' Remove common words from spanish from the Corpus.
#' If we inspect the corp_td, we see that the characters and words are displayed correctly
inspect(corp_td)
# Create the DFM with quanteda library.
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
# Create the TDM with TD library
tdm_td<-TermDocumentMatrix(corp_td)
# Here we see that the Spanish characters are displayed incorrectly (e.g. canción = canciÃ), and "si" is missing.
tdm_td$dimnames$Terms
It looks like quanteda (and tm) is losing the encoding when creating the DFM on the windows platform. In this tidytext question the same problem happens with unnesting tokens. Which works fine now and also quanteda
's tokens
works fine.
If I enforce UTF-8
or latin1
encoding on the @Dimnames$features
of the dfm
you get the correct results.
....
previous code
.....
tdm_quan<-dfm(corp_quan)
# Here we see that the spanish characters are displayed incorrectly for Example: canción = canción
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1
If you do the following:
Encoding(tdm_quan@Dimnames$features) <- "UTF-8"
tdm_quan
Document-feature matrix of: 1 document, 8 features (0% sparse).
1 x 8 sparse Matrix of class "dfm"
features
docs enmascarados si masduro chingán quieres aguantas canción t
text1 1 2 1 1 1 1 1 1