I am using R
for a sentiment analysis. My source file which contains around 50 reviews made by guests has been created in Excel
(with each review recorded in a single row and single column). So, all reviews are found in Column A, with no headers. The file has then been saved as a csv
file and stored in a folder.
My R
codes are as follows:
library (tm)
docs<-Corpus(DirSource('E:/Sentiment Analysis'))
#checking a particular review in the document
writeLines(as.character(docs[[20]]))
Running that last line gives me an out of bound error message.
When I change it to writeLines(as.character(docs[[1]]))
, R displays all the reviews as if they were one whole paragraph.
How can I correct this issue?
The tm::Corpus()
function used with DirSource()
treats each file as a separate document, rather than each line within one file as a separate document.
To read each row of a text file as a separate document, one can use the Corpus(VectorSource())
syntax.
As an example, we'll create a text file, read it from a directory to illustrate how Corpus()
behaves with DirSource()
, versus how we would read it with VectorSource()
.
# represent contents of the text file that was stored in
# ./data/ExcelFile1.csv
aTextFile <- "This is line one of text.
This is line two of text. This is a second sentence in line two."
library(tm)
# read as the OP read it
corpusDir <- "./data/textMining"
aCorpus <- Corpus(DirSource(corpusDir))
length(aCorpus) # shows only one item in list, entire file
# use pipe as separator because documents include commas.
aDataFrame <- read.table("./data/textMining/ExcelFile1.csv",header=FALSE,
sep="|",stringsAsFactors=FALSE)
# use VectorSource to treat each row as a separate document
aCorpus <- Corpus(VectorSource(aDataFrame$V1))
# print the two documents
aCorpus[1]$content
aCorpus[2]$content
...and the output. First, the length of the corpus as we read it with DirSource()
:
> length(aCorpus) # shows only one item in list, entire file
[1] 1
Second, we'll print the two rows from the second read, illustrating that they are treated as separate documents.
> aCorpus <- Corpus(VectorSource(aDataFrame$V1))
> aCorpus[1]$content
[1] "This is line one of text."
> aCorpus[2]$content
[1] "This is line two of text. This is a second sentence in line two. "
>