Search code examples
rgsubtm

How can I solve this R error message relating to atomic vectors?


I am using R in RStudio and I am the running the following codes to perform a sentiment analysis on a set of unstructured texts. Since the bunch of texts contain some invalid characters (caused by the use of emoticons and other typo errors), I want to remove them before proceeding with the analysis.

My R codes (extract) stand as follows:

setwd("E:/sentiment")

doc1=read.csv("book1.csv", stringsAsFactors = FALSE, header = TRUE)

# replace specific characters in doc1
  doc1<-gsub("[^\x01-\x7F]", "", doc1)

library(tm)

#Build Corpus
corpus<- iconv(doc1$Review.Text, to = 'utf-8')
corpus<- Corpus(VectorSource(corpus))

I get the following error message when I reach this line of code corpus<- iconv(doc1$Review.Text, to = 'utf-8'):

Error in doc1$Review.Text : $ operator is invalid for atomic vectors

I had a look at the following StackOverflow questions:

remove emoticons in R using tm package

Replace specific characters within strings

I have also tried the following to clean my texts before running the tm package, but I am getting the same error: doc1<-iconv(doc1, "latin1", "ASCII", sub="")

How can I solve this issue?


Solution

  • With 

    doc1<-gsub("[^\x01-\x7F]", "", doc1)
    

     you overwrite the object doc1, from this on it is not a dataframe but a character vector; see:

    doc1 <- gsub("[^\x01-\x7F]", "", iris)
    str(doc1)
    

    and now clear

    doc1$Species
    

    produces the error.
    Eventually you want to do:

    doc1$Review.Text <- gsub("[^\x01-\x7F]", "", doc1$Review.Text)