Search code examples
rnlptidyversetidytext

Join tokens back to sentence


I am doing some text analysis with some free text data with tidytext. Consider a sample sentences:

"The quick brown fox jumps over the lazy dog"
"I love books"

My token approach using tidytext:

unigrams = tweet_text %>% 
  unnest_tokens(output = word, input = txt) %>%
  anti_join(stop_words)

Results in the following:

The
quick
brown
fox
jumps
over 
the
lazy
dog

I now need to join every unigram back to its original sentence:

"The quick brown fox jumps over the lazy dog" | The
"The quick brown fox jumps over the lazy dog" | quick
"The quick brown fox jumps over the lazy dog" | brown
"The quick brown fox jumps over the lazy dog" | fox
"The quick brown fox jumps over the lazy dog" | jumps 
"The quick brown fox jumps over the lazy dog" | over
"The quick brown fox jumps over the lazy dog" | the
"The quick brown fox jumps over the lazy dog" | lazy 
"The quick brown fox jumps over the lazy dog" | dog
"I love books" | I
"I love books" | love
"I love books  | books


I'm a bit stuck. The solution needs to scale for thousands of sentences. I thought some function like this might be native to tidytext, but haven't found anything yet.


Solution

  • What you are looking for is the drop = FALSE argument:

    library(dplyr)
    #> 
    #> Attaching package: 'dplyr'
    #> The following objects are masked from 'package:stats':
    #> 
    #>     filter, lag
    #> The following objects are masked from 'package:base':
    #> 
    #>     intersect, setdiff, setequal, union
    library(tidytext)
    
    tweet_text <- tibble(id = 1:2,
                         text = c("The quick brown fox jumps over the lazy dog",
                                  "I love books"))
    
    tweet_text %>% 
      unnest_tokens(output = word, input = text, drop = FALSE)
    #> # A tibble: 12 x 3
    #>       id text                                        word 
    #>    <int> <chr>                                       <chr>
    #>  1     1 The quick brown fox jumps over the lazy dog the  
    #>  2     1 The quick brown fox jumps over the lazy dog quick
    #>  3     1 The quick brown fox jumps over the lazy dog brown
    #>  4     1 The quick brown fox jumps over the lazy dog fox  
    #>  5     1 The quick brown fox jumps over the lazy dog jumps
    #>  6     1 The quick brown fox jumps over the lazy dog over 
    #>  7     1 The quick brown fox jumps over the lazy dog the  
    #>  8     1 The quick brown fox jumps over the lazy dog lazy 
    #>  9     1 The quick brown fox jumps over the lazy dog dog  
    #> 10     2 I love books                                i    
    #> 11     2 I love books                                love 
    #> 12     2 I love books                                books
    

    Created on 2020-02-22 by the reprex package (v0.3.0)