Search code examples
rdataframetmcorpus

searching for deleted documents from corpus in R


I want preprocess my texts before its analysis

mydat

   Production of banners 1,2x2, Cutting
Production of a plate with the size 2330 * 600mm
Delivery
















Placement of advertising information on posters 0.85 * 0.65 at Ordzhonikidze Street (TSUM) -Gerzen, side A2 April 2014
Manufacturing of a banner 3,7х2,7
Placement of advertising information on the prismatron 3 * 4 at 60, Ordzhonikidze, Aldjonikidze Street, A (01.12.2011-14.12.2011)
Placement of advertising information on the multipanel 3 * 12 at Malygina-M.Torez street, side A, (01.12.2011-14.12.2011)
Designer services
41526326

12
Mounting and rolling of the RIM on the prismatron 3 * 6

the code

 mydat=read.csv("C:/kr_csv.csv", sep=";",dec=",")

  tw.corpus <- Corpus(VectorSource(mydat$descr))
  tw.corpus <- tm_map(tw.corpus, removePunctuation)
  tw.corpus <- tm_map(tw.corpus, removeNumbers)
  tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
  tw.corpus = tm_map(tw.corpus, stemDocument)


#deleting emptu documents 

doc.m <- DocumentTermMatrix(tw.corpus)


rowTotals <- apply(doc.m , 1, sum) #Find the sum of words in each Document
doc.m.new   <- doc.m[rowTotals> 0, ]  

1. How do I know the numbers of observations that were deleted during preprocessing (for example first, second texts were deleted)? 2.How this numbers of observation delete from original dataset (mydat)?


Solution

  • After pre-processing and stemming your corpus, you are counting the number of words that are left in each document. Surely, the "documents" with no words in them, have a count of zero. Also, the documents with only letters and punctuation are also empty, because you removed those strings.

    In your data, you have many "documents" that are empty lines. In total, you have 28 "documents" in your corpus, but more than half of them are empty lines (i.e. they contain zero words).

    You calculate the word-count for each document in rowTotals. If you check which of the entries in rowTotals are equal to zero, you would get the document numbers that are subsequently removed from doc.m:

    rowTotals
    # 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 
    # 3  5  1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10  2  8  8  2  0  0  0  7 
    

    You can see that documents 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, etc. all contain zero words, and are therefore not present in doc.m. You can automatically get these number with which():

    which( rowTotals == 0)
    # [1] 4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 25 26 27