I want preprocess my texts before its analysis
mydat
Production of banners 1,2x2, Cutting
Production of a plate with the size 2330 * 600mm
Delivery
Placement of advertising information on posters 0.85 * 0.65 at Ordzhonikidze Street (TSUM) -Gerzen, side A2 April 2014
Manufacturing of a banner 3,7х2,7
Placement of advertising information on the prismatron 3 * 4 at 60, Ordzhonikidze, Aldjonikidze Street, A (01.12.2011-14.12.2011)
Placement of advertising information on the multipanel 3 * 12 at Malygina-M.Torez street, side A, (01.12.2011-14.12.2011)
Designer services
41526326
12
Mounting and rolling of the RIM on the prismatron 3 * 6
the code
mydat=read.csv("C:/kr_csv.csv", sep=";",dec=",")
tw.corpus <- Corpus(VectorSource(mydat$descr))
tw.corpus <- tm_map(tw.corpus, removePunctuation)
tw.corpus <- tm_map(tw.corpus, removeNumbers)
tw.corpus = tm_map(tw.corpus, content_transformer(tolower))
tw.corpus = tm_map(tw.corpus, stemDocument)
#deleting emptu documents
doc.m <- DocumentTermMatrix(tw.corpus)
rowTotals <- apply(doc.m , 1, sum) #Find the sum of words in each Document
doc.m.new <- doc.m[rowTotals> 0, ]
1.
How do I know the numbers of observations that were deleted during preprocessing (for example first, second texts were deleted)?
2.
How this numbers of observation delete from original dataset (mydat)?
After pre-processing and stemming your corpus, you are counting the number of words that are left in each document. Surely, the "documents" with no words in them, have a count of zero. Also, the documents with only letters and punctuation are also empty, because you removed those strings.
In your data, you have many "documents" that are empty lines. In total, you have 28 "documents" in your corpus, but more than half of them are empty lines (i.e. they contain zero words).
You calculate the word-count for each document in rowTotals
. If you check which of the entries in rowTotals
are equal to zero, you would get the document numbers that are subsequently removed from doc.m
:
rowTotals
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
# 3 5 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10 2 8 8 2 0 0 0 7
You can see that documents 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, etc. all contain zero words, and are therefore not present in doc.m
. You can automatically get these number with which()
:
which( rowTotals == 0)
# [1] 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 25 26 27