Search code examples
rtmn-gram

tm Bigrams workaround still producing unigrams


I am trying to use tm's DocumentTermMatrix function to produce a matrix with bigrams instead of unigrams. I have tried to use the examples outlined here and here in my function (here are three examples):

make_dtm = function(main_df, stem=F){
  tokenize_ngrams = function(x, n=2) return(rownames(as.data.frame(unclass(textcnt(x,method="string",n=n)))))
  decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
  decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=tokenize_ngrams,
                                                           stopwords=T,
                                                           tolower=T,
                                                           removeNumbers=T,
                                                           removePunctuation=T,
                                                           stemming = stem))
  return(decisions.dtm)
}

make_dtm = function(main_df, stem=F){
  BigramTokenizer = function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
  decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
  decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=BigramTokenizer,
                                                           stopwords=T,
                                                           tolower=T,
                                                           removeNumbers=T,
                                                           removePunctuation=T,
                                                           stemming = stem))
  return(decisions.dtm)
}

make_dtm = function(main_df, stem=F){
  BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
  decisions = Corpus(VectorSource(main_df$CaseTranscriptText))
  decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenize=BigramTokenizer,
                                                           stopwords=T,
                                                           tolower=T,
                                                           removeNumbers=T,
                                                           removePunctuation=T,
                                                           stemming = stem))
  return(decisions.dtm)
}

Rather unfortunately, however, each of these three versions of the function produces the exact same output: a DTM with unigrams, rather than bigrams (image included for simplicity):

enter image description here

For your convenience, here is a subset of the data that I am working with:

x = data.frame("CaseName" = c("Attorney General's Reference (No.23 of 2011)", "Attorney General's Reference (No.31 of 2016)", "Joseph Hill & Co Solicitors, Re"),
               "CaseID"= c("[2011]EWCACrim1496", "[2016]EWCACrim1386", "[2013]EWCACrim775"),
               "CaseTranscriptText" = c("sanchez 2011 02187 6 appeal criminal division 8 2011 2011 ewca crim 14962011 wl 844075 wales wednesday 8 2011 attorney general reference 23 2011 36 criminal act 1988 representation qc general qc appeared behalf attorney general", 
                                        "attorney general reference 31 2016 201601021 2 appeal criminal division 20 2016 2016 ewca crim 13862016 wl 05335394 dbe honour qc sitting cacd wednesday 20 th 2016 reference attorney general 36 criminal act 1988 representation",
                                        "matter wasted costs against company solicitors 201205544 5 appeal criminal division 21 2013 2013 ewca crim 7752013 wl 2110641 date 21 05 2013 appeal honour pawlak 20111354 hearing date 13 th 2013 representation toole respondent qc appellants"))

Solution

  • There are a few issues with your code. I'm just focusing on the last function you created as I don't use the tau or Rweka packages.

    1 to use the tokenizer you need to specify tokenizer = ..., not tokenize = ...

    2 instead of Corpus you need VCorpus.

    3 after adjusting this in your function make_dtm, I was not happy with the results. Not everything specified in the control options is being processed correctly. I created a second function make_dtm_adjusted so you can see the differences between the 2.

    # OP's function adjusted to make it work
    make_dtm = function(main_df, stem=F){
      BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
      decisions = VCorpus(VectorSource(main_df$CaseTranscriptText))
      decisions.dtm = DocumentTermMatrix(decisions, control = list(tokenizer=BigramTokenizer,
                                                               stopwords=T,
                                                               tolower=T,
                                                               removeNumbers=T,
                                                               removePunctuation=T,
                                                               stemming = stem))
      return(decisions.dtm)
    }
    
    # improved function
    make_dtm_adjusted = function(main_df, stem=F){
      BigramTokenizer = function(x) unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
      decisions = VCorpus(VectorSource(main_df$CaseTranscriptText))
    
      decisions <- tm_map(decisions, content_transformer(tolower))
      decisions <- tm_map(decisions, removeNumbers)
      decisions <- tm_map(decisions, removePunctuation)
      # specifying your own stopword list is better as you can use stopwords("smart")
      # or your own list
      decisions <- tm_map(decisions, removeWords, stopwords("english")) 
      decisions <- tm_map(decisions, stripWhitespace)
    
      decisions.dtm = DocumentTermMatrix(decisions, control = list(stemming = stem,
                                                                   tokenizer=BigramTokenizer))
      return(decisions.dtm)
    }