I am creating a document term matrix, with a dictionary and ngram tokenization. It works on my Windows 7 laptop, but not on a similarly configured Ubuntu 14.04.2 server. UPDATE: It also works on a Centos server.
library(tm)
library(RWeka)
library((SnowballC))
newBigramTokenizer = function(x) {
tokenizer1 = RWeka::NGramTokenizer(x, RWeka::Weka_control(min = 1, max = 2))
if (length(tokenizer1) != 0L) { return(tokenizer1)
} else return(WordTokenizer(x))
}
textvect <- c("this is a story about a girl",
"this is a story about a boy",
"a boy and a girl went to the store",
"a store is a place to buy things",
"you can also buy things from a boy or a girl",
"the word store can also be a verb meaning to position something for later use")
textvect <- iconv(textvect, to = "utf-8")
textsource <- VectorSource(textvect)
textcorp <- Corpus(textsource)
textdict <- c("boy", "girl", "store", "story about")
textdict <- iconv(textdict, to = "utf-8")
# OK
dtm <- DocumentTermMatrix(textcorp, control=list(dictionary=textdict))
# OK on Windows laptop
# freezes or generates error on Ubuntu server
dtm <- DocumentTermMatrix(textcorp, control=list(tokenize=newBigramTokenizer,
dictionary=textdict))
Error from the Ubuntu server (at the last line in the source example):
/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/rt.jar: invalid LOC header (bad signature)
Error in simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
'i, j' invalid
In addition: Warning messages:
1: In mclapply(unname(content(x)), termFreq, control) :
scheduled core 1 encountered error in user code, all values of the job will be affected
2: In simple_triplet_matrix(i = i, j = j, v = as.numeric(v), nrow = length(allTerms), :
NAs introduced by coercion
I have already tried some of the suggestions in Twitter Data Analysis - Error in Term Document Matrix and Error in simple_triplet_matrix -- unable to use RWeka to count Phrases
I had thought my problem could be attributed to one of these, but now the script is running on a Centos server with the same locales and JVM as the problematic Ubuntu server.
Here are the two environments:
R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)
PS C:\> java -version
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8
java version "1.7.0_72"
Java(TM) SE Runtime Environment (build 1.7.0_72-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.72-b04, mixed mode)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-23 tm_0.6 NLP_0.1-5
loaded via a namespace (and not attached):
[1] grid_3.1.2 parallel_3.1.2 rJava_0.9-6 RWekajars_3.7.11-1 slam_0.1-32
[6] tools_3.1.2
R version 3.1.2 (2014-10-31) Platform: x86_64-pc-linux-gnu (64-bit)
$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (IcedTea 2.5.5) (7u79-2.5.5-0ubuntu0.14.04.2)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8
[4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8 LC_ADDRESS=en_US.UTF-8
[10] LC_TELEPHONE=en_US.UTF-8 LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-23 tm_0.6 NLP_0.1-5
loaded via a namespace (and not attached):
[1] grid_3.1.2 parallel_3.1.2 rJava_0.9-6 RWekajars_3.7.11-1 slam_0.1-32
[6] tools_3.1.2
R version 3.2.0 (2015-04-16) Platform: x86_64-redhat-linux-gnu (64-bit) Running under: CentOS Linux 7 (Core)
$ java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (rhel-2.5.5.1.el7_1-x86_64 u79-b14)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=en_US.UTF-8
[9] LC_ADDRESS=en_US.UTF-8 LC_TELEPHONE=en_US.UTF-8
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RWeka_0.4-24 tm_0.6-2 NLP_0.1-8
loaded via a namespace (and not attached):
[1] parallel_3.2.0 tools_3.2.0 slam_0.1-32 grid_3.2.0
[5] rJava_0.9-6 RWekajars_3.7.12-1
If you prefer something simpler but no less flexible or powerful, how about trying out the quanteda package? It can make quick work of your dictionary and bigram task in three lines:
# or: devtools::install_github("kbenoit/quanteda")
require(quanteda)
# use dictionary() to construct dictionary from named list
textdict <- dictionary(list(mydict = c("boy", "girl", "store", "story about")))
# convert to document-feature matrix, with 1grams + 2grams, apply dictionary
dfm(textvect, dictionary = textdict, ngrams = 1:2, concatenator = " ")
## Document-feature matrix of: 6 documents, 1 feature.
## 6 x 1 sparse Matrix of class "dfmSparse"
## features
## docs mydict
## text1 2
## text2 2
## text3 3
## text4 1
## text5 2
## text6 1
# alternative is to consider the dictionary as a thesaurus of synonyms,
# not exclusive in feature selection as is a dictionary
dfm.all <- dfm(textvect, thesaurus = textdict,
ngrams = 1:2, concatenator = " ", verbose = FALSE)
topfeatures(dfm.all)
## a MYDICT a boy a girl is is a to a story about about a
## 11 11 3 3 3 3 3 2 2 2
dfm_sort(dfm.all)[1:6, 1:12]
## Document-feature matrix of: 6 documents, 12 features.
## 6 x 12 sparse Matrix of class "dfmSparse"
## features
## docs a MYDICT a boy a girl is is a to a story about about a also buy
## text1 2 2 0 1 1 1 0 1 1 1 0 0
## text2 2 2 1 0 1 1 0 1 1 1 0 0
## text3 2 3 1 1 0 0 1 0 0 0 0 0
## text4 2 1 0 0 1 1 1 0 0 0 0 1
## text5 2 2 1 1 0 0 0 0 0 0 1 1
## text6 1 1 0 0 0 0 1 0 0 0 1 0