I am using R
in RStudio
and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts.
Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.
My R codes (extract) stand as follows:
setwd("E:/sentiment")
doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)
# replace specific characters in doc1
doc1<-gsub("[^\x01-\x7F]", "", doc1)
library(tm)
#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))
I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8')
:
Error in doc1$Review.Text : $ operator is invalid for atomic vectors
I had a look at the following StackOverflow
questions:
remove emoticons in R using tm package
Replace specific characters within strings
I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")
How can I solve this issue?
With
doc1<-gsub("[^\x01-\x7F]", "", doc1)
you overwrite the object doc1
, from this on it is not a dataframe but a character vector; see:
doc1 <- gsub("[^\x01-\x7F]", "", iris)
str(doc1)
and now clear
doc1$Species
produces the error.
Eventually you want to do:
doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)