Search code examples
rtokenwhatsapptidytextunnest

tidytext: Issue with unnest_tokens and token = 'ngrams'


I'm running the following code

library(rwhatsapp)
library(tidytext)

chat <- rwa_read(x = c(
  "31/1/15 04:10:59 - Menganito: Was it good?",
  "31/1/15 14:10:59 - Fulanito: Yes, it was"
))

chat %>% as_tibble() %>% 
  unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)

But I'm getting the following error:

Error in unnest_tokens.data.frame(., output = bigram, input = text, token = "ngrams",  : 
  If collapse = TRUE (such as for unnesting by sentence or paragraph), unnest_tokens needs all input columns to be atomic vectors (not lists)

I tried doing some research on Google but couldn't find an answer. Column text is a character vector so I don't understand why I'm getting an error saying it's not.


Solution

  • The issue is because there are some list columns that are NULL

    str(chat)
    #tibble [2 × 6] (S3: tbl_df/tbl/data.frame)
    # $ time      : POSIXct[1:2], format: "2015-01-31 04:10:59" "2015-01-31 14:10:59"
    # $ author    : Factor w/ 2 levels "Fulanito","Menganito": 2 1
    # $ text      : chr [1:2] "Was it good?" "Yes, it was"
    # $ source    : chr [1:2] "text input" "text input"
    # $ emoji     :List of 2   ###
    #  ..$ : NULL
    #  ..$ : NULL
    # $ emoji_name:List of 2    ###
    #  ..$ : NULL
    #  ..$ : NULL
    

    we can remove it and it works now

    library(rwhatsapp)
    library(tidytext)
    chat %>% 
       select_if(~ !is.list(.)) %>%
       unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
    # A tibble: 4 x 4
    #  time                author    source     bigram 
    #  <dttm>              <fct>     <chr>      <chr>  
    #1 2015-01-31 04:10:59 Menganito text input was it 
    #2 2015-01-31 04:10:59 Menganito text input it good
    #3 2015-01-31 14:10:59 Fulanito  text input yes it 
    #4 2015-01-31 14:10:59 Fulanito  text input it was 
    

    Also, by default collapse=TRUE, and this creates an issue when there are NULL elements because the lengths gets different when it is collapsed. One option is to specify collapse = FALSE

    chat %>% 
       unnest_tokens(output = bigram, input = text, token = "ngrams",
            n = 2, collapse= FALSE)
    # A tibble: 4 x 6
    #  time                author    source     emoji  emoji_name bigram 
    #  <dttm>              <fct>     <chr>      <list> <list>     <chr>  
    #1 2015-01-31 04:10:59 Menganito text input <NULL> <NULL>     was it 
    #2 2015-01-31 04:10:59 Menganito text input <NULL> <NULL>     it good
    #3 2015-01-31 14:10:59 Fulanito  text input <NULL> <NULL>     yes it 
    #4 2015-01-31 14:10:59 Fulanito  text input <NULL> <NULL>     it was