Search code examples
rtmstemming

stemDocument R text mining


My data is a txt file and looks as follows:
words number_doc
overwiew 1
client 1
store 1
marge 1
price 2
stock 2
economics 2

The numbers of the documents are sorted (from the smallest to the largest). Now I want for each document all the words that belongs to the document. Now they stand in a column, but I want al the words in a textDocument (from the package tm, because it is neccesary for some functions in that package). I did this as follows:

 data <- read.table("poging.txt", header = TRUE)
 data

 doc <- c()
 #I paste all the words from a document together:
 doc[1] <- paste(data[1:4,1], collapse = ' ')
 doc[2] <- paste(data[1:4,1], collapse = ' ')

 #Make a data.frame of it
 doc_df <- data.frame(docs = doc, row.names = 1:2)

 #Install package
 install.packages("tm")
 library(tm)

 #Make a Dataframesource of it so that each row is seen as a document
 ds <- DataframeSource(doc_df)
 inspect(VCorpus(ds))

 #Now I want to stem for example document number 1
 stemDocument(ds[[1]])

But by using ds[[1]] as argument, it doesn't work. He can't find document number 1. Can someone help me?

In the examples om the package tm they use the data crude. I want that my data is the same format as that from crude.

Silke


Solution

  • stemDocument() is meant to be use with a TextDocument, not a DataSource. You want to use the DataSource to create a corpus, then you can extract the documents from there.

    ds <- DataframeSource(doc_df)
    corpus <- VCorpus(ds)
    stemDocument(corpus[[1]])
    

    Note that stemDocument will return a new document and will not update the corpus permanently. So if you wish to do anything with the output, be sure to save it somewhere.