I'm preprocessing some text data for further analysis. I tokenized the text using unnest_tokens() [into singular words] but want to keep certain commonly-occuring 2 word phrases such as "United States" or "social security." How can I do this using tidytext?
tidy_data <- data %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
dput(data[1:6, 1:6])
structure(list(race = c("US House", "US House", "US House", "US House",
"", "US House"), district = c(8L, 3L, 6L, 17L, 2L, 1L), party = c("Republican",
"Republican", "Republican", "Republican", "", "Republican"),
state = c("AZ", "AZ", "KY", "TX", "IL", "NH"), sponsor = c(4,
4, 4, 1, NA, 4), approve = structure(c(1L, 1L, 1L, 4L, NA,
1L), .Label = c("no oral statement of approval, authorization",
"beginning of the spot", "middle of the spot", "end of the spot"
), class = "factor")), row.names = c(NA, 6L), class = "data.frame")
If I were in this situation and I only had a short list of two-word phrases I need to keep in my analysis, I would do some prudent replacing before and after tokenization.
First, I would replace the two-word phrases with something that will stick together and not get broken apart by the tokenization process I'm using, like perhaps "united states"
to "united_states"
.
library(tidyverse)
library(tidytext)
df <- tibble(text = c("I live in the United States",
"United we stand, divided we fall",
"Information security is important!",
"I work at the Social Security Administration"))
df_parsed <- df %>%
mutate(text = str_to_lower(text),
text = str_replace_all(text, "united states", "united_states"),
text = str_replace_all(text, "social security", "social_security"))
df_parsed
#> # A tibble: 4 x 1
#> text
#> <chr>
#> 1 i live in the united_states
#> 2 united we stand, divided we fall
#> 3 information security is important!
#> 4 i work at the social_security administration
Then you can tokenize like normal, and afterward, replace the things you just made with the two-word phrases again, so "united_states"
back to "united states"
.
df_parsed %>%
unnest_tokens(word, text) %>%
mutate(word = case_when(word == "united_states" ~ "united states",
word == "social_security" ~ "social security",
TRUE ~ word))
#> # A tibble: 21 x 1
#> word
#> <chr>
#> 1 i
#> 2 live
#> 3 in
#> 4 the
#> 5 united states
#> 6 united
#> 7 we
#> 8 stand
#> 9 divided
#> 10 we
#> # … with 11 more rows
Created on 2019-08-03 by the reprex package (v0.3.0)
If you have a long list of these, it's going to get difficult and onerous, and then it might make sense to look at ways to use bigram and unigram tokenization. You can see an example of that here.