Search code examples
rtext-miningtmterm-document-matrix

My DocumentTermMatrix reduces to Zero columns


train <- read.delim('train.tsv', header= T, fileEncoding= "windows-1252",stringsAsFactors=F)

Train.tsv contains 1,56,060 lines of text with 4 column names Phrase, PhraseID, SentenceID and Sentiment(on scale of 0 to 4).Phrase column has the text lines. (Tm package already loaded) R Version: 3.1.2 ; OS: Windows 7, 64 bit, 4 GB RAM.

> dput(head(train,6)) 
structure(list(PhraseId = 1:6, SentenceId = c(1L, 1L, 1L, 1L, 
1L, 1L), Phrase = c("A series of escapades demonstrating the adage that what is good for the goose is also good for the gander , some of which occasionally amuses but none of which amounts to much of a story .", 
"A series of escapades demonstrating the adage that what is good for the goose", 
"A series", "A", "series", "of escapades demonstrating the adage that what is good for the goose"
), Sentiment = c(1L, 2L, 2L, 2L, 2L, 2L)), .Names = c("PhraseId", 
"SentenceId", "Phrase", "Sentiment"), row.names = c(NA, 6L), class = "data.frame")

This is the top 6 rows of train document.

clean_corpus <- function(corpus)
  {
   mycorpus <- tm_map(corpus, removeWords,stopwords("english"))  
   mycorpus <- tm_map(mycorpus, removeWords,c("movie","actor","actress"))  
   mycorpus <- tm_map(mycorpus, stripWhitespace)  
   mycorpus <- tm_map(mycorpus, tolower)  
   mycorpus <- tm_map(mycorpus, removeNumbers)
   mycorpus <- tm_map(mycorpus, removePunctuation)
   mycorpus <- tm_map(mycorpus, PlainTextDocument ) 
   return(mycorpus) 
}

# Build DTM
generateDTM <- function(df)
{
   m <- list(Sentiment = "Sentiment", Phrase = "Phrase")
   myReader <- readTabular(mapping = m)
   mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))

#Code to attach sentiment label with every text line
    for (i in 1:length(mycorpus)) 
     {
     attr(mycorpus[[i]], "Sentiment") <- df$Sentiment[i]
   }
   mycorpus <- clean_corpus(mycorpus)
   dtm <- DocumentTermMatrix(mycorpus)
   return(dtm)
}

dtm1 <- generateDTM(train) 

Here I have made two functions. One to clean the corpus and other to make DTM (Document Term Matrix). I have also linked each sentiment value with every line of text. Now when i use dimensions of dtm1; it shows 156060 rows but 0 columns.

So, how can i generate a DTM with sentiment labels attached?


Solution

  • When you set up your reader, you want to map something to the "content" of the document, otherwise it doesn't know what text to use to make the corpus. Othe rvalues are stored as metadata. Try changing the code to

    m <- list(Sentiment = "Sentiment", content = "Phrase")
    myReader <- readTabular(mapping = m)
    mycorpus <- Corpus(DataframeSource(df), readerControl = list(reader = myReader))