Search code examples
rtm

Why is R merging all the rows in my CSV file as one whole document?


I am using R for a sentiment analysis. My source file which contains around 50 reviews made by guests has been created in Excel (with each review recorded in a single row and single column). So, all reviews are found in Column A, with no headers. The file has then been saved as a csv file and stored in a folder.

My R codes are as follows:

library (tm)
docs<-Corpus(DirSource('E:/Sentiment Analysis'))
#checking a particular review in the document
writeLines(as.character(docs[[20]]))

Running that last line gives me an out of bound error message. When I change it to writeLines(as.character(docs[[1]])), R displays all the reviews as if they were one whole paragraph.

How can I correct this issue?


Solution

  • The tm::Corpus() function used with DirSource() treats each file as a separate document, rather than each line within one file as a separate document.

    To read each row of a text file as a separate document, one can use the Corpus(VectorSource()) syntax.

    As an example, we'll create a text file, read it from a directory to illustrate how Corpus() behaves with DirSource(), versus how we would read it with VectorSource().

    # represent contents of the text file that was stored in 
    # ./data/ExcelFile1.csv
    aTextFile <- "This is line one of text.
    This is line two of text. This is a second sentence in line two."
    
    library(tm)
    # read as the OP read it
    corpusDir <- "./data/textMining"
    aCorpus <- Corpus(DirSource(corpusDir))
    length(aCorpus) # shows only one item in list, entire file
    
    # use pipe as separator because documents include commas. 
    aDataFrame <- read.table("./data/textMining/ExcelFile1.csv",header=FALSE,
                             sep="|",stringsAsFactors=FALSE)
    # use VectorSource to treat each row as a separate document
    aCorpus <- Corpus(VectorSource(aDataFrame$V1))
    # print the two documents 
    aCorpus[1]$content
    aCorpus[2]$content 
    

    ...and the output. First, the length of the corpus as we read it with DirSource():

    > length(aCorpus) # shows only one item in list, entire file
    [1] 1
    

    Second, we'll print the two rows from the second read, illustrating that they are treated as separate documents.

    > aCorpus <- Corpus(VectorSource(aDataFrame$V1))
    > aCorpus[1]$content
    [1] "This is line one of text."
    > aCorpus[2]$content
    [1] "This is line two of text. This is a second sentence in line two. "
    >