Search code examples
rpdftexttext-miningtm

Using tm() to mine PDFs for two and three word phrases


I'm trying to mine a set of PDFs for specific two and three word phrases. I know this question has been asked under various circumstances and

This solution partly works. However, the list does not return strings containing more than one word.

I've tried the solutions offered in these threads here, here, for example (as well as many others). Unfortunately nothing works.

Also, the qdap library won't load and I wasted an hour trying to solve that problem, so this solution won't work either, even though it seems reasonably easy.

library(tm)

data("crude")
crude <- as.VCorpus(crude)
crude <- tm_map(crude, content_transformer(tolower))

my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")

dtm <- DocumentTermMatrix(crude, control=list(dictionary = my_words))

# create data.frame from documenttermmatrix
df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL)
head(df1)

As you can see, the output returns "contract.prices" instead of "contract prices" so I'm looking for a simple solution to this. File 127 includes the phrase 'contract prices' so the table should record at least one instance of this.

I'm also happy to share my actual data, but I'm not sure how to save a small portion of it (it's gigantic). So for now I'm using a substitute with the 'crude' data.


Solution

  • Here is a way to get what you want using the tm package together with RWeka. You need to create a separate tokenizer function that you plug into the DocumentTermMatrix function. RWeka plays very nicely with tm for this.

    If you don't want to install RWeka due to java dependencies, you can use any other package like tidytext or quanteda. If you have need of speed because of the size of your data, I advice using the quanteda package (example below the tm code). Quanteda runs in parallel and with quanteda_options you can specify how many cores you want to use (2 cores are the default).

    note:

    Note that the unigrams and bigrams in your dictionary overlap . In the example used you will see that in text 127 "prices" (3) and "contract prices" (1) will double count the prices.

    library(tm)
    library(RWeka)
    
    data("crude")
    crude <- as.VCorpus(crude)
    crude <- tm_map(crude, content_transformer(tolower))
    
    my_words <- c("contract", "prices", "contract prices", "diamond", "shamrock", "diamond shamrock")
    
    
    # adjust to min = 2 and max = 3 for 2 and 3 word ngrams
    RWeka_tokenizer <- function(x) {
      NGramTokenizer(x, Weka_control(min = 1, max = 2)) 
    }
    
    dtm <- DocumentTermMatrix(crude, control=list(tokenize = RWeka_tokenizer,
                                                  dictionary = my_words))
    
    # create data.frame from documenttermmatrix
    df1 <- data.frame(docs = dtm$dimnames$Docs, as.matrix(dtm), row.names = NULL, check.names = FALSE)
    

    For speed if you have a big corpus quanteda might be better:

    library(quanteda)
    
    corp_crude <- corpus(crude)
    # adjust ngrams to 2:3 for 2 and 3 word ngrams
    toks_crude <- tokens(corp_crude, ngrams = 1:2, concatenator = " ")
    toks_crude <- tokens_keep(toks_crude, pattern = dictionary(list(words = my_words)), valuetype = "fixed")
    dfm_crude <- dfm(toks_crude)
    df1 <- convert(dfm_crude, to = "data.frame")