Search code examples
rtexttokentidytext

How can I tokenize a text column in R? unnest function not working


I am a new R user. Will really appreciate if you can help me with solving the tokenization problem:

My task in brief: I am trying to import a text file in into R. One of the text columns is Headline. The dataset is basically a collection of news articles related to a disease.

Issue: I have tried many times to tokenize it using the unnest_tokens function.

It is showing me the following error messages:

Error in UseMethod("unnest_tokens_") : no applicable method for 'unnest_tokens_' applied to an object of class "character"

Error in unnest_tokens(word, Headline) : object 'word' not found

library(dplyr)
library(tidytext)

DengueNews %>%
unnest_tokens(word, Headline)

Note: Link of the dataset:https://drive.google.com/file/d/18VWg-2sO11GpwxMGF1UbziodoWK9B9Ru/view?usp=sharing I am following the instructions from https://www.tidytextmining.com/tidytext.html


Solution

  • It is not clear how the data was read. As mentioned in the comments, if the data column 'Headline' is character class, it should work. Here, we use read_excl from readxl package to read the dataset. By default, columns that are character will be returned with character class attribute.

    library(readxl)
    library(tidytext)
    DengueNews <- read_excel("DengueNews.xlsx")
    class(DengueNew$Headline)
    #[1] "character"
    
    DengueNews %>%
      unnest_tokens(word, Headline)
    # A tibble: 217 x 4
       Serial Date  Newscontent                                                                                                                                             word      
        <dbl> <chr> <chr>                                                                                                                                                   <chr>     
     1    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dghs      
     2    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 491       
     3    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… more      
     4    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… hospitali…
     5    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… for       
     6    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… dengue    
     7    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… in        
     8    216 43727 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA total of 491 dengue patients have been admitted to different hospitals acro… 24hrs     
     9    215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… 1         
    10    215 43725 "The unofficial death toll is reported to be over 157, so far\r\n\r\n\r\nA fifth-grader schoolgirl has died of dengue fever at Dhaka Medical College a… more      
    # … with 207 more rows
    

    If we change the column class to another class factor, it would fail

    library(dplyr)
    DengueNews %>%
       mutate(Headline = factor(Headline)) %>%
       unnest_tokens(word, Healine)