When I unnest_tokens for a list I enter manually; the output includes the row number each word came from.
library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)
#test data
text<- c( "furloughs","Working MORE for less pay", "total burnout and exhaustion")
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
The results look like this, which is what I want.
row_name word
<int> <chr>
1 1 furlough
2 2 work
3 2 more
4 2 for
5 2 less
6 2 pai
7 3 total
8 3 burnout
9 3 and
10 3 exhaust
But when I try to read in the real responses from a csv file:
#Import data
text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
But otherwise use the same code:
#break text file into single words and list which row they are in
text_df <- tibble(text = text)
tidy_text <- text_df %>%
mutate_all(as.character) %>%
mutate(row_name = row_number())%>%
unnest_tokens(word, text) %>%
mutate(word = wordStem(word))
I get the entire token list assigned to row 1 and then again assigned to row 2 and so on.
row_name word
<int> <chr>
1 1 c
2 1 furlough
3 1 work
4 1 more
5 1 for
6 1 less
7 1 pai
8 1 total
9 1 burnout
10 1 and
OR, if I move the mutate(row_name = row_number) to after the unnest command, I get the row number for each token.
word row_name
<chr> <int>
1 c 1
2 furlough 2
3 work 3
4 more 4
5 for 5
6 less 6
7 pai 7
8 total 8
9 burnout 9
10 and 10
What am I missing?
I guess if you import the text using text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)
, text is a data frame while if you enter it manually it is a vector.
If you would alter the code to: text_df <- tibble(text = text$col_name)
to select the column from the data frame (which is a vector) in the csv case, I think you should get the same result as before.