Search code examples
rspell-checkingstemming

How do i optimize the performance of stemming and spell check in R?


I have ~1,4m documents with average of characters per document of(Median:250 and Mean:470).

I want to perform spell check and stemming, before classifying them.

Simulated document:

sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
    rep(times = 6) %>%
    paste(collapse = " ")

nchar(sentence)
[1] 407 

function to perform first spellcheck and then stemming

library(hunspell)
library(magrittr)

spellAndStem <- function(sent, language = "en_US"){
  words <- sentence %>%
    strsplit(split = " ") %>%
    unlist

  # spelling
  correct <- hunspell_check(
        words = words, 
        dict = dictionary(language)
  )

  words[!correct] %<>%
    hunspell_suggest(dict = language) %>%
    sapply(FUN = "[", 1)

  # stemming
  words %>%
    hunspell_stem(dict = dictionary(language)) %>%
    unlist %>%
    paste(collapse = " ")
}

I looked in to hunspell() function to give over the document as a whole for Performance gains, but i dont see how i could do spell check and stemming in that sequence.

Time measurement:

> library(microbenchmark)
> microbenchmark(spellAndStem(sentence), times = 100)
Unit: milliseconds
                   expr      min       lq     mean   median       uq      max neval
 spellAndStem(sentence) 680.3601 689.8842 700.7957 694.3781 702.7493 798.9544   100

With 0.7s per document, it would require 0.7*1400000/3600/24 = 11.3 days to do the calculation.

Question:

How can i optimize this Performance?

Final remark:

Target language is 98% German and 2% english. Not sure if the info matters, just for completeness.


Solution

  • You can substantially optimize your code by performing expensive steps on the vocabulary instead of all words in the document. The quanteda package offers a really useful object class or this called tokens:

    toks <- quanteda::tokens(sentence)
    unclass(toks)
    #> $text1
    #>  [1]  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9 10 11
    #> [26] 12  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9 10
    #> [51] 11 12  1  2  3  4  5  4  6  7  8  9 10 11 12  1  2  3  4  5  4  6  7  8  9
    #> [76] 10 11 12
    #> 
    #> attr(,"types")
    #>  [1] "We"       "aree"     "drivng"   "as"       "fast"     "we"      
    #>  [7] "drove"    "yestrday" "or"       "evven"    "fastter"  "zysxzw"  
    #> attr(,"padding")
    #> [1] FALSE
    #> attr(,"what")
    #> [1] "word"
    #> attr(,"ngrams")
    #> [1] 1
    #> attr(,"skip")
    #> [1] 0
    #> attr(,"concatenator")
    #> [1] "_"
    #> attr(,"docvars")
    #> data frame with 0 columns and 1 row
    

    As you can see, text is split into vocabulary (types) and position of the words. We can use this to optimize your code by performing all steps on the types instead of the entire text:

    spellAndStem_tokens <- function(sent, language = "en_US") {
    
      sent_t <- quanteda::tokens(sent)
    
      # extract types to only work on them
      types <- quanteda::types(sent_t)
    
      # spelling
      correct <- hunspell_check(
        words = as.character(types), 
        dict = hunspell::dictionary(language)
      )
    
      pattern <- types[!correct]
      replacement <- sapply(hunspell_suggest(pattern, dict = language), FUN = "[", 1)
    
      types <- stringi::stri_replace_all_fixed(
        types,
        pattern, 
        replacement,
        vectorize_all = FALSE
      )
    
      # stemming
      types <- hunspell_stem(types, dict = dictionary(language))
    
    
      # replace original tokens
      sent_t_new <- quanteda::tokens_replace(sent_t, quanteda::types(sent_t), as.character(types))
    
      sent_t_new <- quanteda::tokens_remove(sent_t_new, pattern = "NULL", valuetype = "fixed")
    
      paste(as.character(sent_t_new), collapse = " ")
    }
    

    I'm using the bench package to do the benchmarking as it also checks if the results of the two functions are identical and as I find it more comfortable in general:

    res <- bench::mark(
      spellAndStem(sentence),
      spellAndStem_tokens(sentence)
    )
    
    res
    #> # A tibble: 2 x 6
    #>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    #> 1 spellAndStem(sentence)           807ms    807ms      1.24     259KB        0
    #> 2 spellAndStem_tokens(sentence)    148ms    150ms      6.61     289KB        0
    
    summary(res, relative = TRUE)
    #> # A tibble: 2 x 6
    #>   expression                      min median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr>                    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
    #> 1 spellAndStem(sentence)         5.44   5.37      1         1         NaN
    #> 2 spellAndStem_tokens(sentence)  1      1         5.33      1.11      NaN
    

    The new function is 5.44 times faster than the original one. Note though that the difference is getting even more pronounced the larger the input text is:

    sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
      rep(times = 600) %>%
      paste(collapse = " ")
    
    res_big <- bench::mark(
      spellAndStem(sentence),
      spellAndStem_tokens(sentence)
    )
    
    res_big
    #> # A tibble: 2 x 6
    #>   expression                         min   median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr>                    <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
    #> 1 spellAndStem(sentence)         1.27m    1.27m      0.0131  749.81KB        0
    #> 2 spellAndStem_tokens(sentence)  178.26ms 182.12ms   5.51      1.94MB        0
    summary(res_big, relative = TRUE)
    #> # A tibble: 2 x 6
    #>   expression                      min median `itr/sec` mem_alloc `gc/sec`
    #>   <bch:expr>                    <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
    #> 1 spellAndStem(sentence)         428.   419.        1       1         NaN
    #> 2 spellAndStem_tokens(sentence)   1      1       420.       2.65      NaN
    

    As you can see, the time it takes to process the 100 x bigger sample is almost the same as for the smaller one. This is because the vocabulary between the two is exactly the same. We can extrapolate from this result to your entire dataset assuming this bigger sample represents 100 of your documents. The function should take less than an hour (0.17826 * 14000 / 3600 = 0.69) but the calculation is really imperfect as the actual time it takes to run it on your real data will depend almost solely on the size of the vocabulary.

    Besides the programming/performance aspect, I have a few more concerns that might not be applicable in your specific case:

    1. I would suggest changing the last line in the function to sapply(as.list(sent_t_new), paste, collapse = " ") as this will not collapse all documents into one long string but keep them separate.
    2. Currently, your setup removes words where hunspell could not find any suggestions for. I copied this approach (see the tokens_remove command) but you might want to think about at least outputting the discarded words instead of removing them silently.
    3. If the function above is for preparation for some other text analysis, it would make more sense to transform the data directly into a document-term-matrix before stemming and spell checking are performed.
    4. Stemming is just an approximation to lemmatization, which is the process of actually finding the base form of a word. Additionally, stemming usually works quite poorly in German. Depending on what you are doing, you might want to do lemmatization instead (e.g., using spacyr) or simply turning it off since stemming rarely improves results in German.