Search code examples
rtext-miningtmtext-analysisqdap

R - Text Analysis - Misleading results


I am doing some text analysis of comments from bank customers related to mortgages and I find a couple of things I do understand.

1) After cleaning data without applying Stemming Words and checking the dimension of the TDM the number of terms (2173) is smaller than the number of documents (2373)(This is before remove stop words and being the TDM a 1-gram).

2) Also, I wanted to check the 2-words frequency (rowSums(Matrix)) of the bi-gram tokenizing the TDM. The issue is that for example I have gotten as the most repeated result the 2-words "Proble miss". Since this grouping was already strange, I have gone to the dataset, "Control +F", to try to find and i could not. Questions: it seems that the code some how has stemmed these words, how is it possible? (From the top 25 bi-words, this one is the only one that seems to be stemmed). Is this not supposed to ONLY create bi-grams that are always together?

{file_cleaning <-  replace_number(files$VERBATIM)
file_cleaning <-  replace_abbreviation(file_cleaning)
file_cleaning <-  replace_contraction(file_cleaning)
file_cleaning <- tolower(file_cleaning)
file_cleaning <- removePunctuation(file_cleaning)
file_cleaning[467]
file_cleaned <- stripWhitespace(file_cleaning)

custom_stops <- c("Bank")
file_cleaning_stops <- c(custom_stops, stopwords("en"))
file_cleaned_stopped<- removeWords(file_cleaning,file_cleaning_stops)

file_cleaned_corups<- VCorpus(VectorSource(file_cleaned))
file_cleaned_tdm <-TermDocumentMatrix(file_cleaned_corups)
dim(file_cleaned_tdm) # Number of terms <number of documents
file_cleaned_mx <- as.matrix(file_cleaned_tdm)

file_cleaned_corups<- VCorpus(VectorSource(file_cleaned_stopped))
file_cleaned_tdm <-TermDocumentMatrix(file_cleaned_corups)
file_cleaned_mx <- as.matrix(file_cleaned_tdm)

dim(file_cleaned_mx)
file_cleaned_mx[220:225, 475:478]

coffee_m <- as.matrix(coffee_tdm)

term_frequency <- rowSums(file_cleaned_mx)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:10]


BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bigram_dtm <- TermDocumentMatrix(file_cleaned_corups, control = list(tokenize = BigramTokenizer))
dim(bigram_dtm)

bigram_bi_mx <- as.matrix(bigram_dtm)
term_frequency <- rowSums(bigram_bi_mx)
term_frequency <- sort(term_frequency, decreasing = TRUE)
term_frequency[1:15]

freq_bigrams <- findFreqTerms(bigram_dtm, 25)
freq_bigrams}

SAMPLE of DATASET:

> dput(droplevels(head(files,4)))

structure(list(Score = c(10L, 10L, 10L, 7L), Comments = structure(c(4L,

3L, 1L, 2L), .Label = c("They are nice an quick. 3 years with them, and no issue.",

"Staff not very friendly.",

"I have to called them 3 times. They are very slow.",

"Quick and easy. High value."

), class = "factor")), row.names = c(NA, 4L), class = "data.frame")

Solution

  • Q1: There are situations where you can end up with less terms than documents.

    First you are using vectorsource; the number of documents are the number of vectors you have in your txt. This is not really representative of the number of documents. A vector with a space in it would count as a document. Secondly you are removing stopwords. If there are many of these in your text, a lot of words will disappear. Finally TermDocumentMatrix by default removes all words smaller than 3. So if there are any small words left after removing stopwords, these will be removed as well. You can adjust this by adjusting the option wordLengths when creating a TermDocumentMatrix / DocumentTermMatrix.

    # wordlengths starting at length 1, default is 3
    TermDocumentMatrix(corpus, control=list(wordLengths=c(1, Inf)))
    

    Q2: without a sample document this is a bit of a guess.

    Likely a combination of the functions replace_number, replace_contraction, replace_abbreviation, removePunctuation and stripWhitespace. This might result in a word that you can't find very fast. Best bet is to look for each word starting with prob. "proble" is as far as I can see, not a correct stem. Also qdap and tm don't do any stemming without you specifying it.

    You also have a mistake in your custom_stops. All stopwords are in lowercase and you specified that your text should be in lowercase. So your custom_stops should also be in lowercase. "bank" instead of "Bank".