I have ~1,4m documents with average of characters per document of(Median:250 and Mean:470).
I want to perform spell check and stemming, before classifying them.
Simulated document:
sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
rep(times = 6) %>%
paste(collapse = " ")
nchar(sentence)
[1] 407
function to perform first spellcheck and then stemming
library(hunspell)
library(magrittr)
spellAndStem <- function(sent, language = "en_US"){
words <- sentence %>%
strsplit(split = " ") %>%
unlist
# spelling
correct <- hunspell_check(
words = words,
dict = dictionary(language)
)
words[!correct] %<>%
hunspell_suggest(dict = language) %>%
sapply(FUN = "[", 1)
# stemming
words %>%
hunspell_stem(dict = dictionary(language)) %>%
unlist %>%
paste(collapse = " ")
}
I looked in to hunspell()
function to give over the document as a whole for Performance gains, but i dont see how i could do spell check and stemming in that sequence.
Time measurement:
> library(microbenchmark)
> microbenchmark(spellAndStem(sentence), times = 100)
Unit: milliseconds
expr min lq mean median uq max neval
spellAndStem(sentence) 680.3601 689.8842 700.7957 694.3781 702.7493 798.9544 100
With 0.7s per document, it would require 0.7*1400000/3600/24 = 11.3 days to do the calculation.
Question:
How can i optimize this Performance?
Final remark:
Target language is 98% German and 2% english. Not sure if the info matters, just for completeness.
You can substantially optimize your code by performing expensive steps on the vocabulary instead of all words in the document. The quanteda
package offers a really useful object class or this called tokens
:
toks <- quanteda::tokens(sentence)
unclass(toks)
#> $text1
#> [1] 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9 10 11
#> [26] 12 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9 10
#> [51] 11 12 1 2 3 4 5 4 6 7 8 9 10 11 12 1 2 3 4 5 4 6 7 8 9
#> [76] 10 11 12
#>
#> attr(,"types")
#> [1] "We" "aree" "drivng" "as" "fast" "we"
#> [7] "drove" "yestrday" "or" "evven" "fastter" "zysxzw"
#> attr(,"padding")
#> [1] FALSE
#> attr(,"what")
#> [1] "word"
#> attr(,"ngrams")
#> [1] 1
#> attr(,"skip")
#> [1] 0
#> attr(,"concatenator")
#> [1] "_"
#> attr(,"docvars")
#> data frame with 0 columns and 1 row
As you can see, text is split into vocabulary (types
) and position of the words. We can use this to optimize your code by performing all steps on the types
instead of the entire text:
spellAndStem_tokens <- function(sent, language = "en_US") {
sent_t <- quanteda::tokens(sent)
# extract types to only work on them
types <- quanteda::types(sent_t)
# spelling
correct <- hunspell_check(
words = as.character(types),
dict = hunspell::dictionary(language)
)
pattern <- types[!correct]
replacement <- sapply(hunspell_suggest(pattern, dict = language), FUN = "[", 1)
types <- stringi::stri_replace_all_fixed(
types,
pattern,
replacement,
vectorize_all = FALSE
)
# stemming
types <- hunspell_stem(types, dict = dictionary(language))
# replace original tokens
sent_t_new <- quanteda::tokens_replace(sent_t, quanteda::types(sent_t), as.character(types))
sent_t_new <- quanteda::tokens_remove(sent_t_new, pattern = "NULL", valuetype = "fixed")
paste(as.character(sent_t_new), collapse = " ")
}
I'm using the bench
package to do the benchmarking as it also checks if the results of the two functions are identical and as I find it more comfortable in general:
res <- bench::mark(
spellAndStem(sentence),
spellAndStem_tokens(sentence)
)
res
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spellAndStem(sentence) 807ms 807ms 1.24 259KB 0
#> 2 spellAndStem_tokens(sentence) 148ms 150ms 6.61 289KB 0
summary(res, relative = TRUE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 spellAndStem(sentence) 5.44 5.37 1 1 NaN
#> 2 spellAndStem_tokens(sentence) 1 1 5.33 1.11 NaN
The new function is 5.44 times faster than the original one. Note though that the difference is getting even more pronounced the larger the input text is:
sentence <- "We aree drivng as fast as we drove yestrday or evven fastter zysxzw" %>%
rep(times = 600) %>%
paste(collapse = " ")
res_big <- bench::mark(
spellAndStem(sentence),
spellAndStem_tokens(sentence)
)
res_big
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 spellAndStem(sentence) 1.27m 1.27m 0.0131 749.81KB 0
#> 2 spellAndStem_tokens(sentence) 178.26ms 182.12ms 5.51 1.94MB 0
summary(res_big, relative = TRUE)
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 spellAndStem(sentence) 428. 419. 1 1 NaN
#> 2 spellAndStem_tokens(sentence) 1 1 420. 2.65 NaN
As you can see, the time it takes to process the 100 x bigger sample is almost the same as for the smaller one. This is because the vocabulary between the two is exactly the same. We can extrapolate from this result to your entire dataset assuming this bigger sample represents 100 of your documents. The function should take less than an hour (0.17826 * 14000 / 3600 = 0.69) but the calculation is really imperfect as the actual time it takes to run it on your real data will depend almost solely on the size of the vocabulary.
Besides the programming/performance aspect, I have a few more concerns that might not be applicable in your specific case:
sapply(as.list(sent_t_new), paste, collapse = " ")
as this will not collapse all documents into one long string but keep them separate.hunspell
could not find any suggestions for. I copied this approach (see the tokens_remove
command) but you might want to think about at least outputting the discarded words instead of removing them silently.spacyr
) or simply turning it off since stemming rarely improves results in German.