Search code examples
rtmstringrrweka

TextMining in R - Extracting 2 gram for only few terms and 1 gram for rest


text = c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')

I want to extract 1-gram token for most words and 2 gram tokens for words such as extremely, no , not

For example when I get tokens they should be as below: the, nurse, was, extremely helpful, she, truly, gem, helping, no issue, not bad

These are the terms that should show in the term document matrix

Thank you for the help!!


Solution

  • Here is a possible solution (assuming you want to not split only on c("extremely", "no", "not"), but also want to include words similar to them). The pkg qdapDictionaries has some dictionaries for amplification.words (like "extremely"), negation.words (like "no" & "not"), and more.

    Here is an example of how to split on a space except for when the space follows a word in a predefined vector (here we define the vector using amplification.words, negation.words, & deamplification.words from qdapDictionaries). You can change the definition of no_split_words if you want to use a more customized list of words.

    performing split

    library(stringr)
    library(qdapDictionaries)
    
    text <-  c('the nurse was extremely helpful', 'she was truly a gem','helping', 'no issue', 'not bad')
    
    # define list of words where we dont want to split on space
    no_split_words <- c(amplification.words, negation.words, deamplification.words)
    # collapse words into form "word1|word2| ... |wordn
    regex_or       <- paste(no_split_words, collapse="|")
    # define regex to split on space given that the prev word not in no_split_words
    split_regex    <- regex(paste("((?<!",regex_or,"))\\s"))
    
    # perform split
    str_split(text, split_regex)
    
    #output
    [[1]]
    [1] "the"               "nurse"             "was"               "extremely helpful"
    
    [[2]]
    [1] "she"     "was"     "truly a" "gem"    
    
    [[3]]
    [1] "helping"
    
    [[4]]
    [1] "no issue"
    
    [[5]]
    [1] "not bad"
    

    creating dtm with tidytext

    (assumes above code chunk was already run)

    library(tidytext)
    library(dplyr)
    
    doc_df <- data_frame(text) %>% 
      mutate(doc_id = row_number())
    
    # creates doc term matrix from tm package
    # creates a binary dtm
    # can define value as term freq, tfidf, etc for a nonbinary dtm
    tm_dtm <- doc_df %>% 
      unnest_tokens(tokens, text, token="regex", pattern=split_regex) %>% 
      mutate(value = 1) %>%  
      cast_dtm(doc_id, tokens, value)
    
    # can coerce to matrix if desired
    matrix_dtm <- as.matrix(tm_dtm)