Search code examples
rformattokenizetidytext

achieve tokenize in a txt format with tidytext


I'm trying to work on tidytext, with a .txt file called: texto_revision with the following structure:

# A tibble: 254 x 230
   X1     X2     X3     X4    X5    X6    X7    X8    X9    X10   X11   X12   X13   X14   X15   X16  
   <chr>  <chr>  <chr>  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 la     expro~ de     la    tier~ ocur~ con   frec~ dura~ el    proc~ rapi~ de    la    urba~ en   
 2 como   las    difer~ en    el    moti~ del   cons~ cons~ en    esta~ unid~ y     china afec~ la   
 3 las    desig~ etnic~ en    los   patr~ de    cons~ (pre~ de    vest~ joye~ auto~ han   sido  obje~
 4 este   artic~ exami~ el    impa~ de    vari~ dife~ indi~ en    la    prop~ de    los   cons~ a    
 5 este   artic~ inves~ la    infl~ de    los   regi~ poli~ sobre la    impo~ 
 #   ...

When trying to use unnest_tokens format, with the following code:

library(tidytext)

texto_revision %>%
    unnest_tokens(word, text)

I get the following error:

Error: Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1.

To try to correct the error and continue with the tokenization ahead I tried to convert the text into a data frame with the following code:

text_df <- as.data.frame(texto_revision)

but I still get the following error

Error in check_input(x) : Input must be a character vector of any length or a list of character vectors, each of which has a length of 1.


Solution

  • Note thatthe syntax for unnest_tokens is "unnest_tokens([new column name],[reference column]." There appears to be no "text" column in your tibble/data frame. Below is a toy example to illustrate:

    State <- as.character(c("SC is in the South","NC is in the south", 
                            "NY is in  the north"))
    DF <- data.frame(State, stringsAsFactors = FALSE)
    
    > DF
                   State
     1 SC is in the South
     2 NC is in the south
     .....
     DF %>% unnest_tokens(word,State)
    
         word
    1      sc
    1.1    is
    1.2    in
    1.3   the
    ....