Search code examples
rubuntutext-miningquanteda

R: dtm with ngram tokenizer plus dictionary broken in Ubuntu?


I am creating a document term matrix, with a dictionary and ngram tokenization. It works on my Windows 7 laptop, but not on a similarly configured Ubuntu 14.04.2 server. UPDATE: It also works on a Centos server.

library(tm)
library(RWeka)
library((SnowballC))

newBigramTokenizer = function(x) {
  tokenizer1 = RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 2))
  if (length(tokenizer1) != 0L) { return(tokenizer1)
  } else return(WordTokenizer(x))
}

textvect <- c("this is a story about a girl", 
              "this is a story about a boy", 
              "a boy and a girl went to the store",
              "a store is a place to buy things",
              "you can also buy things from a boy or a girl",
              "the word store can also be a verb meaning to position something for later use")

textvect <- iconv(textvect, to = "utf-8")
textsource <- VectorSource(textvect)
textcorp <- Corpus(textsource)

textdict <- c("boy", "girl", "store", "story about")
textdict <- iconv(textdict, to = "utf-8")

# OK
dtm <- DocumentTermMatrix(textcorp, control=list(dictionary=textdict))

# OK on Windows laptop
# freezes or generates error on Ubuntu server
dtm <- DocumentTermMatrix(textcorp, control=list(tokenize=newBigramTokenizer,
                                             dictionary=textdict))

Error from the Ubuntu server (at the last line in the source example):

/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature)
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
  'i, j' invalid
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
  scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms),  :
  NAs introduced by coercion

I have already tried some of the suggestions in Twitter Data Analysis - Error in Term Document Matrix and Error in simple_triplet_matrix -- unable to use RWeka to count Phrases

I had thought my problem could be attributed to one of these, but now the script is running on a Centos server with the same locales and JVM as the problematic Ubuntu server.

  • the locales
  • the minor difference in JVMs
  • the parallel library? mclapply is mentioned in the error message, and parallel is listed in the session info (for all systems, though.)

Here are the two environments:

R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)

PS C:\> java -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)

locale: 
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RWeka_0.4-23 tm_0.6       NLP_0.1-5   

loaded via a namespace (and not attached):
[1] grid_3.1.2         parallel_3.1.2     rJava_0.9-6        RWekajars_3.7.11-1 slam_0.1-32       
[6] tools_3.1.2         

R version 3.1.2 (2014-10-31) Platform: x86_64-pc-linux-gnu (64-bit)

$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)

locale:
[1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8          
[4] LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
[7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8       
[10] LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] RWeka_0.4-23 tm_0.6       NLP_0.1-5   

loaded via a namespace (and not attached):
[1] grid_3.1.2         parallel_3.1.2     rJava_0.9-6        RWekajars_3.7.11-1 slam_0.1-32       
[6] tools_3.1.2     

R version 3.2.0 (2015-04-16) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)

$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (rhel-2.5.5.1.el7_1-x86_64 u79-b14)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)


locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8           LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8
 [9] LC_ADDRESS=en_US.UTF-8        LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] RWeka_0.4-24 tm_0.6-2     NLP_0.1-8

loaded via a namespace (and not attached):
[1] parallel_3.2.0     tools_3.2.0        slam_0.1-32        grid_3.2.0
[5] rJava_0.9-6        RWekajars_3.7.12-1

Solution

  • If you prefer something simpler but no less flexible or powerful, how about trying out the quanteda package? It can make quick work of your dictionary and bigram task in three lines:

    # or: devtools::install_github("kbenoit/quanteda")
    require(quanteda)
    
    # use dictionary() to construct dictionary from named list
    textdict <- dictionary(list(mydict = c("boy", "girl", "store", "story about")))
    
    # convert to document-feature matrix, with 1grams + 2grams, apply dictionary
    dfm(textvect, dictionary = textdict, ngrams = 1:2, concatenator = " ")
    ## Document-feature matrix of: 6 documents, 1 feature.
    ## 6 x 1 sparse Matrix of class "dfmSparse"
    ##        features
    ## docs    mydict
    ##   text1      2
    ##   text2      2
    ##   text3      3
    ##   text4      1
    ##   text5      2
    ##   text6      1
    
    # alternative is to consider the dictionary as a thesaurus of synonyms, 
    # not exclusive in feature selection as is a dictionary 
    dfm.all <- dfm(textvect, thesaurus = textdict,
                   ngrams = 1:2, concatenator = " ", verbose = FALSE)
    topfeatures(dfm.all)
    ##      a  MYDICT   a boy  a girl      is    is a      to a story   about about a 
    ##     11      11       3       3       3       3       3       2       2       2 
    
    dfm_sort(dfm.all)[1:6, 1:12]
    ## Document-feature matrix of: 6 documents, 12 features.
    ## 6 x 12 sparse Matrix of class "dfmSparse"
    ##        features
    ## docs    a MYDICT a boy a girl is is a to a story about about a also buy
    ##   text1 2      2     0      1  1    1  0       1     1       1    0   0
    ##   text2 2      2     1      0  1    1  0       1     1       1    0   0
    ##   text3 2      3     1      1  0    0  1       0     0       0    0   0
    ##   text4 2      1     0      0  1    1  1       0     0       0    0   1
    ##   text5 2      2     1      1  0    0  0       0     0       0    1   1
    ##   text6 1      1     0      0  0    0  1       0     0       0    1   0