text = c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')
I want to extract 1-gram token for most words and 2 gram tokens for words such as extremely, no , not
For example when I get tokens they should be as below: the, nurse, was, extremely helpful, she, truly, gem, helping, no issue, not bad
These are the terms that should show in the term document matrix
Thank you for the help!!
Here is a possible solution (assuming you want to not split only on c("extremely", "no", "not")
, but also want to include words similar to them).
The pkg qdapDictionaries
has some dictionaries for amplification.words
(like "extremely"), negation.words
(like "no" & "not"), and more.
Here is an example of how to split on a space except for when the space follows a word in a predefined vector (here we define the vector using amplification.words
, negation.words
, & deamplification.words
from qdapDictionaries
). You can change the definition of no_split_words
if you want to use a more customized list of words.
library(stringr)
library(qdapDictionaries)
text <- c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')
# define list of words where we dont want to split on space
no_split_words <- c(amplification.words, negation.words, deamplification.words)
# collapse words into form "word1|word2| ... |wordn
regex_or <- paste(no_split_words, collapse="|")
# define regex to split on space given that the prev word not in no_split_words
split_regex <- regex(paste("((?<!",regex_or,"))\\s"))
# perform split
str_split(text, split_regex)
#output
[[1]]
[1] "the" "nurse" "was" "extremely helpful"
[[2]]
[1] "she" "was" "truly a" "gem"
[[3]]
[1] "helping"
[[4]]
[1] "no issue"
[[5]]
[1] "not bad"
tidytext
(assumes above code chunk was already run)
library(tidytext)
library(dplyr)
doc_df <- data_frame(text) %>%
mutate(doc_id = row_number())
# creates doc term matrix from tm package
# creates a binary dtm
# can define value as term freq, tfidf, etc for a nonbinary dtm
tm_dtm <- doc_df %>%
unnest_tokens(tokens, text, token="regex", pattern=split_regex) %>%
mutate(value = 1) %>%
cast_dtm(doc_id, tokens, value)
# can coerce to matrix if desired
matrix_dtm <- as.matrix(tm_dtm)