I use this way to read a csv file:
Here the str()
$ an_id : int 4840 41981 40482 37473 33278 29083 30940 29374 24023 23922 ...
It seems to be an int character column and using the following it is converted to chr
df$an_id <- paste0("doc_", df$an_id)
However when I execute this command I receive this error:
toks <- corpus(df, docid_field = "an_id") %>%
tokens()
Error in corpus.data.frame(df, docid_field = "an_id") : column name text not found
Is there any different way to read the file or pass the column as text?
If I save this data into csv file and read the file and run the command they work properly
dtext <- data.frame(id = c(1,2,3,4), text = c("here","This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.", "The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.", "There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."),stringsAsFactors = F)
As mentioned in the comments by @Nathalie, the following does the trick if the data is in a data.frame. docid_field references to the document ids column and text_field should reference the column that contains the text.
toks <- corpus(df,
docid_field = "an_id",
text_field = "text") %>%
tokens()
str(toks)
List of 4
$ doc_1: chr "here"
$ doc_2: chr [1:39] "This" "dataset" "contains" "movie" ...
$ doc_3: chr [1:36] "The" "core" "dataset" "contains" ...
$ doc_4: chr [1:105] "There" "are" "two" "top-level" ...
- attr(*, "types")= chr [1:102] "here" "This" "dataset" "contains" ...
- attr(*, "padding")= logi FALSE
- attr(*, "class")= chr "tokens"
- attr(*, "what")= chr "word"
- attr(*, "ngrams")= int 1
- attr(*, "skip")= int 0
- attr(*, "concatenator")= chr "_"
- attr(*, "docvars")='data.frame': 4 obs. of 0 variables
data:
df <- structure(list(an_id = c("doc_1", "doc_2", "doc_3", "doc_4"),
text = c("here", "This dataset contains movie reviews along with their associated binary sentiment polarity labels. It is intended to serve as a benchmark for sentiment classification. This document outlines how the dataset was gathered, and how to use the files provided.",
"The core dataset contains 50,000 reviews split evenly into 25k train and 25k test sets. The overall distribution of labels is balanced (25k pos and 25k neg). We also include an additional 50,000 unlabeled documents for unsupervised learning.",
"There are two top-level directories [train/, test/] corresponding to the training and test sets. Each contains [pos/, neg/] directories for the reviews with binary labels positive and negative. Within these directories, reviews are stored in text files named following the convention [[id]_[rating].txt] where [id] is a unique id and [rating] is the star rating for that review on a 1-10 scale. For example, the file [test/pos/200_8.txt] is the text for a positive-labeled test set example with unique id 200 and star rating 8/10 from IMDb. The [train/unsup/] directory has 0 for all ratings because the ratings are omitted for this portion of the dataset."
)), row.names = c(NA, -4L), class = "data.frame")