I have tried using quanteda to extract top features but the results were modified words, i.e. 'faulti' instead of 'faulty'. Is this supposed to be the expected results?
I have tried searching for the top features keywords in the original dataset but no match as expected.
Edit: if i set options stem=FALSE for function dfm() then the key words resumed to normal words.
library(quanteda)
corpus1 = corpus(as.character(training_data$Elec_rmk))
kwic(corpus1, 'faulty')
#[text25701, 4] Convertible roof sometime | faulty | . SD card missing.
#[text25701, 22] unavailable). Pilot lamp | faulty | .
dfm1 <- dfm(
corpus1,
ngrams = 1,
remove = stopwords("english"),
remove_punct = TRUE,
remove_numbers = TRUE,
stem = TRUE)
tf1 <- topfeatures(dfm1, n = 10)
tf1
# key words were modified/truncated words?
#faulti malfunct light damag miss cover rear loos lamp plate
# 562 523 454 337 331 325 295 259 250 238
library(stringr)
sum(str_detect(training_data$Elec_rmk, 'faulti')) # 0
sum(str_detect(training_data$Elec_rmk, 'faulty')) # 495
dfm
by default doesn't stem. But you set the stem option to TRUE hency "faulti". But as you mention in your edit remark setting this to FALSE (or omitting this setting) will return unstemmed words.
But it looks like you are misinterpreting in what the str_detect
returns and what topfeatures
returns. str_detect
only detects if the search string is present in the sentence, but not how many times. Your sum only counts the present of the word (495) in the sentences. topfeatures
counts the amount of times a word actually appears in the text (562)).
Look at the following examples to see the difference:
# 1 line of text (paragraph)
my_text <- "I have two examples of two words in this text. Isn't having two words fun?"
topfeatures(dfm(my_text, remove = stopwords("english"), remove_punct = TRUE), n = 2)
two words
3 2
sum(str_detect(my_text, "two"))
[1] 1
# 2 sentences.
my_text2 <- c("I have two examples of two words in this text.", "Isn't having two words fun?")
topfeatures(dfm(my_text2, remove = stopwords("english"), remove_punct = TRUE), n = 2)
two words
3 2
sum(str_detect(my_text2, "two"))
[1] 2
For the first example, topfeatures
returns 3 for the word "two", str_detect
just returns 1. There is only 1 vector / piece of text to for str_detect
to look in.
For the second example, topfeatures
again returns 3 for the word "two". str_detect
now returns 2, there are 2 values in the vector so it detects the word "two" in both sentences, but it is still short of the actual 3 it should be.