Search code examples
rrstudiounnesttidytext

unnest_tokens and its error("")


I am working with tidytext. When I command unnest_tokens. R returns the error

Please supply column name

How can I solve this error?

library(tidytext)
library(tm)
library(dplyr)
library(stats)
library(base)
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
  #Build a corpus: a collection of statements
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
f <-Corpus(DirSource("C:/Users/Boon/Desktop/Dissertation/F"))
doc_dir <- "C:/Users/Boon/Desktop/Dis/F/f.csv"
doc <- read.csv(file_loc, header = TRUE)
docs<- Corpus(DataframeSource(doc))
dtm <- DocumentTermMatrix(docs)
text_df<-data_frame(line=1:115,docs=docs)

#This is the output from the code above,which is fine!: 
# text_df
# A tibble: 115 x 2
#line          docs
#<int> <S3: VCorpus>
# 1      1 <S3: VCorpus>
#2      2 <S3: VCorpus>
#3      3 <S3: VCorpus>
#4      4 <S3: VCorpus>
#5      5 <S3: VCorpus>
#6      6 <S3: VCorpus>
#7      7 <S3: VCorpus>
#8      8 <S3: VCorpus>
#9      9 <S3: VCorpus>
#10    10 <S3: VCorpus>
# ... with 105 more rows

unnest_tokens(word, docs)

# Error: Please supply column name

Solution

  • If you want to convert your text data to a tidy format, you do not need to transform it to a corpus or a document term matrix or anything first. That is one of the main ideas behind using a tidy data format for text; you don't use those other formats, unless you need to for modeling.

    You just put the raw text into a data frame, then use unnest_tokens() to tidy it. (I am making some assumptions here about what your CSV looks like; it would be more helpful to post a reproducible example next time.)

    library(dplyr)
    
    docs <- data_frame(line = 1:4,
                       document = c("This is an excellent document.",
                                    "Wow, what a great set of words!",
                                    "Once upon a time...",
                                    "Happy birthday!"))
    
    docs
    #> # A tibble: 4 x 2
    #>    line                        document
    #>   <int>                           <chr>
    #> 1     1  This is an excellent document.
    #> 2     2 Wow, what a great set of words!
    #> 3     3             Once upon a time...
    #> 4     4                 Happy birthday!
    
    library(tidytext)
    
    docs %>%
        unnest_tokens(word, document)
    #> # A tibble: 18 x 2
    #>     line      word
    #>    <int>     <chr>
    #>  1     1      this
    #>  2     1        is
    #>  3     1        an
    #>  4     1 excellent
    #>  5     1  document
    #>  6     2       wow
    #>  7     2      what
    #>  8     2         a
    #>  9     2     great
    #> 10     2       set
    #> 11     2        of
    #> 12     2     words
    #> 13     3      once
    #> 14     3      upon
    #> 15     3         a
    #> 16     3      time
    #> 17     4     happy
    #> 18     4  birthday