I'm doing NLP with the tidymodels framework, taking advantage of the textrecipes package, which has recipe steps for text preprocessing. Here, step_tokenize
takes a character vector as input and returns a tokenlist
object. Now, I want to perform spell checking on the new tokenized variable with a custom function for correct spelling, using functions from the hunspell package, but I get the following error (link to the spell check blog post):
Error: Problem with `mutate()` column `desc`.
i `desc = correct_spelling(desc)`.
x is.character(words) is not TRUE
Apparently, tokenlists don't parse easily to character vectors. I've noticed the existence of step_untokenize
, but simply disolves the tokenlist by pasting and collapsing and that's not what I need.
library(tidyverse)
library(tidymodels)
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_mutate(desc = correct_spelling(desc))
product_recipe %>% prep()
product_descriptions %>%
unnest_tokens(word, desc) %>%
mutate(word = correct_spelling(word))
There isn't a canonical way to do this using {textrecipes} yet. We need 2 things, a function that takes a vector of tokens and returns spell-checked tokens (you provided that) and a way to apply that function to each element of the tokenlist
. For now, there isn't a general step that lets you do that, but you can cheat it by passing the function to custom_stemmer
in step_stem()
. Giving you the results you want
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(textrecipes)
library(hunspell)
product_descriptions <- tibble(
desc = c("goood product", "not sou good", "vad produkt"),
price = c(1000, 700, 250)
)
correct_spelling <- function(input) {
output <- case_when(
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_US')) ~
hunspell_suggest(input, dictionary('en_US')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
product_recipe <- recipe(desc ~ price, data = product_descriptions) %>%
step_tokenize(desc) %>%
step_stem(desc, custom_stemmer = correct_spelling) %>%
step_tf(desc)
product_recipe %>%
prep() %>%
bake(new_data = NULL)
#> # A tibble: 3 × 6
#> price tf_desc_cad tf_desc_good tf_desc_not tf_desc_product tf_desc_sou
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1000 0 1 0 1 0
#> 2 700 0 1 1 0 1
#> 3 250 1 0 0 1 0