Search code examples
rrow-numbertidytextunnest

keeping document number in tidytext


When I unnest_tokens for a list I enter manually; the output includes the row number each word came from.

library(dplyr)
library(tidytext)
library(tidyr)
library(NLP)
library(tm)
library(SnowballC)
library(widyr)
library(textstem)


#test data
text<- c( "furloughs","Working MORE for less pay",  "total burnout and exhaustion")

#break text file into single words and list which row they are in
  text_df <- tibble(text = text)

  tidy_text <- text_df %>% 
    mutate_all(as.character) %>% 
    mutate(row_name = row_number())%>%    
    unnest_tokens(word, text) %>%
    mutate(word = wordStem(word))

The results look like this, which is what I want.

   row_name word    
      <int> <chr>   
 1        1 furlough
 2        2 work    
 3        2 more    
 4        2 for     
 5        2 less    
 6        2 pai     
 7        3 total   
 8        3 burnout 
 9        3 and     
10        3 exhaust

But when I try to read in the real responses from a csv file:

#Import data  
 text <- read.csv("TextSample.csv", stringsAsFactors=FALSE)

But otherwise use the same code:

#break text file into single words and list which row they are in
  text_df <- tibble(text = text)

  tidy_text <- text_df %>% 
    mutate_all(as.character) %>% 
    mutate(row_name = row_number())%>%

    unnest_tokens(word, text) %>%

    mutate(word = wordStem(word)) 

I get the entire token list assigned to row 1 and then again assigned to row 2 and so on.

   row_name word    
      <int> <chr>   
 1        1 c       
 2        1 furlough
 3        1 work    
 4        1 more    
 5        1 for     
 6        1 less    
 7        1 pai     
 8        1 total   
 9        1 burnout 
10        1 and   

OR, if I move the mutate(row_name = row_number) to after the unnest command, I get the row number for each token.

   word     row_name
   <chr>       <int>
 1 c               1
 2 furlough        2
 3 work            3
 4 more            4
 5 for             5
 6 less            6
 7 pai             7
 8 total           8
 9 burnout         9
10 and            10

What am I missing?


Solution

  • I guess if you import the text using text <- read.csv("TextSample.csv", stringsAsFactors=FALSE), text is a data frame while if you enter it manually it is a vector.

    If you would alter the code to: text_df <- tibble(text = text$col_name) to select the column from the data frame (which is a vector) in the csv case, I think you should get the same result as before.