Search code examples
rnlpspell-checkingr-recipes

step_mutate with textrecipes tokenlists


I'm doing NLP with the tidymodels framework, taking advantage of the textrecipes package, which has recipe steps for text preprocessing. Here, step_tokenize takes a character vector as input and returns a tokenlist object. Now, I want to perform spell checking on the new tokenized variable with a custom function for correct spelling, using functions from the hunspell package, but I get the following error (link to the spell check blog post):

Error: Problem with `mutate()` column `desc`.
i `desc = correct_spelling(desc)`.
x is.character(words) is not TRUE

Apparently, tokenlists don't parse easily to character vectors. I've noticed the existence of step_untokenize, but simply disolves the tokenlist by pasting and collapsing and that's not what I need.

REPREX

library(tidyverse)
library(tidymodels)
library(textrecipes)
library(hunspell)

product_descriptions <- tibble(
  desc = c("goood product", "not sou good", "vad produkt"),
  price = c(1000, 700, 250)
)

correct_spelling <- function(input) {
  output <- case_when(
    # check and (if required) correct spelling
    !hunspell_check(input, dictionary('en_US')) ~
      hunspell_suggest(input, dictionary('en_US')) %>%
      # get first suggestion, or NA if suggestions list is empty
      map(1, .default = NA) %>%
      unlist(),
    TRUE ~ input # if word is correct
  )
  # if input incorrectly spelled but no suggestions, return input word
  ifelse(is.na(output), input, output)
}

product_recipe <- recipe(desc ~ price, data = product_descriptions) %>% 
  step_tokenize(desc) %>% 
  step_mutate(desc = correct_spelling(desc))

product_recipe %>% prep()

WHAT I WANT, BUT WITHOUT RECIPES

product_descriptions %>% 
  unnest_tokens(word, desc) %>% 
  mutate(word = correct_spelling(word))

Solution

  • There isn't a canonical way to do this using {textrecipes} yet. We need 2 things, a function that takes a vector of tokens and returns spell-checked tokens (you provided that) and a way to apply that function to each element of the tokenlist. For now, there isn't a general step that lets you do that, but you can cheat it by passing the function to custom_stemmer in step_stem(). Giving you the results you want

    library(tidyverse)
    library(tidymodels)
    #> Registered S3 method overwritten by 'tune':
    #>   method                   from   
    #>   required_pkgs.model_spec parsnip
    library(textrecipes)
    library(hunspell)
    
    product_descriptions <- tibble(
      desc = c("goood product", "not sou good", "vad produkt"),
      price = c(1000, 700, 250)
    )
    
    correct_spelling <- function(input) {
      output <- case_when(
        # check and (if required) correct spelling
        !hunspell_check(input, dictionary('en_US')) ~
          hunspell_suggest(input, dictionary('en_US')) %>%
          # get first suggestion, or NA if suggestions list is empty
          map(1, .default = NA) %>%
          unlist(),
        TRUE ~ input # if word is correct
      )
      # if input incorrectly spelled but no suggestions, return input word
      ifelse(is.na(output), input, output)
    }
    
    product_recipe <- recipe(desc ~ price, data = product_descriptions) %>% 
      step_tokenize(desc) %>% 
      step_stem(desc, custom_stemmer = correct_spelling) %>%
      step_tf(desc)
    
    product_recipe %>% 
      prep() %>%
      bake(new_data = NULL)
    #> # A tibble: 3 × 6
    #>   price tf_desc_cad tf_desc_good tf_desc_not tf_desc_product tf_desc_sou
    #>   <dbl>       <dbl>        <dbl>       <dbl>           <dbl>       <dbl>
    #> 1  1000           0            1           0               1           0
    #> 2   700           0            1           1               0           1
    #> 3   250           1            0           0               1           0