I came across the text2vec package today and it's exactly what I need for a particular problem. However, I haven't been able to figure out how to export a dtm created with text2vec to some kind of output file. My ultimate goal is to generate features in R using text2vec and import the resulting matrices into H2O for further modeling. H2O can read either CSV or SVMLight formats.
The first one I've created is 987753 x 8806 sparse Matrix of class "dgCMatrix", with 3625049 entries
, so it's pretty big. It's not possible to use as.matrix() to write it out to CSV since it's too big. I thought that I might be able to easily write it out as SVMLight format, but haven't been able to find a library that works. Anyone have any other options for getting this output to a file that I can read into H2O?
There are several packages who can do that. Take a look into https://github.com/Laurae2/sparsity - imho most promising:
library(text2vec)
library(sparsity)
data("movie_review")
N = 5000
tokens = movie_review$review[1:N] %>% tolower %>% word_tokenizer
it = itoken(tokens, progressbar = T)
dtm = create_dtm(it, hash_vectorizer())
write.svmlight(dtm, labelVector = movie_review$sentiment, file = "dtm.svmlight")