Search code examples
rquanteda

r quanteda top features extraction returning modified words


I have tried using quanteda to extract top features but the results were modified words, i.e. 'faulti' instead of 'faulty'. Is this supposed to be the expected results?

I have tried searching for the top features keywords in the original dataset but no match as expected.

Edit: if i set options stem=FALSE for function dfm() then the key words resumed to normal words.

library(quanteda)    
corpus1 = corpus(as.character(training_data$Elec_rmk))
kwic(corpus1, 'faulty')

#[text25701, 4]              Convertible roof sometime | faulty | . SD card missing.               
#[text25701, 22]              unavailable). Pilot lamp | faulty | .  

dfm1 <- dfm(
  corpus1, 
  ngrams = 1, 
  remove = stopwords("english"),
  remove_punct = TRUE,
  remove_numbers = TRUE,
  stem = TRUE)
tf1 <- topfeatures(dfm1, n = 10)
tf1
# key words were modified/truncated words?
#faulti malfunct    light    damag     miss    cover     rear     loos     lamp    plate 
#   562      523      454      337      331      325      295      259      250      238 

library(stringr)
sum(str_detect(training_data$Elec_rmk, 'faulti')) # 0
sum(str_detect(training_data$Elec_rmk, 'faulty')) # 495

Solution

  • dfm by default doesn't stem. But you set the stem option to TRUE hency "faulti". But as you mention in your edit remark setting this to FALSE (or omitting this setting) will return unstemmed words.

    But it looks like you are misinterpreting in what the str_detect returns and what topfeatures returns. str_detect only detects if the search string is present in the sentence, but not how many times. Your sum only counts the present of the word (495) in the sentences. topfeatures counts the amount of times a word actually appears in the text (562)).

    Look at the following examples to see the difference:

    # 1 line of text (paragraph)
    my_text <- "I have two examples of two words in this text. Isn't having two words fun?"
    
    topfeatures(dfm(my_text, remove = stopwords("english"), remove_punct = TRUE), n = 2)
      two words 
        3     2 
    sum(str_detect(my_text, "two"))
    [1] 1
    
    # 2 sentences.
    my_text2 <- c("I have two examples of two words in this text.", "Isn't having two words fun?")
    
    topfeatures(dfm(my_text2, remove = stopwords("english"), remove_punct = TRUE), n = 2)
      two words 
        3     2 
    sum(str_detect(my_text2, "two"))
    [1] 2
    

    For the first example, topfeatures returns 3 for the word "two", str_detect just returns 1. There is only 1 vector / piece of text to for str_detect to look in.

    For the second example, topfeatures again returns 3 for the word "two". str_detect now returns 2, there are 2 values in the vector so it detects the word "two" in both sentences, but it is still short of the actual 3 it should be.