I am doing some text analysis with some free text data with tidytext. Consider a sample sentences:
"The quick brown fox jumps over the lazy dog"
"I love books"
My token approach using tidytext:
unigrams = tweet_text %>%
unnest_tokens(output = word, input = txt) %>%
anti_join(stop_words)
Results in the following:
The
quick
brown
fox
jumps
over
the
lazy
dog
I now need to join every unigram back to its original sentence:
"The quick brown fox jumps over the lazy dog" | The
"The quick brown fox jumps over the lazy dog" | quick
"The quick brown fox jumps over the lazy dog" | brown
"The quick brown fox jumps over the lazy dog" | fox
"The quick brown fox jumps over the lazy dog" | jumps
"The quick brown fox jumps over the lazy dog" | over
"The quick brown fox jumps over the lazy dog" | the
"The quick brown fox jumps over the lazy dog" | lazy
"The quick brown fox jumps over the lazy dog" | dog
"I love books" | I
"I love books" | love
"I love books | books
I'm a bit stuck. The solution needs to scale for thousands of sentences. I thought some function like this might be native to tidytext, but haven't found anything yet.
What you are looking for is the drop = FALSE
argument:
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidytext)
tweet_text <- tibble(id = 1:2,
text = c("The quick brown fox jumps over the lazy dog",
"I love books"))
tweet_text %>%
unnest_tokens(output = word, input = text, drop = FALSE)
#> # A tibble: 12 x 3
#> id text word
#> <int> <chr> <chr>
#> 1 1 The quick brown fox jumps over the lazy dog the
#> 2 1 The quick brown fox jumps over the lazy dog quick
#> 3 1 The quick brown fox jumps over the lazy dog brown
#> 4 1 The quick brown fox jumps over the lazy dog fox
#> 5 1 The quick brown fox jumps over the lazy dog jumps
#> 6 1 The quick brown fox jumps over the lazy dog over
#> 7 1 The quick brown fox jumps over the lazy dog the
#> 8 1 The quick brown fox jumps over the lazy dog lazy
#> 9 1 The quick brown fox jumps over the lazy dog dog
#> 10 2 I love books i
#> 11 2 I love books love
#> 12 2 I love books books
Created on 2020-02-22 by the reprex package (v0.3.0)