Search code examples
rsparse-matrixsvmlighttext2vec

Write a text2vec dtm to a file (csv or svmlight)


I came across the text2vec package today and it's exactly what I need for a particular problem. However, I haven't been able to figure out how to export a dtm created with text2vec to some kind of output file. My ultimate goal is to generate features in R using text2vec and import the resulting matrices into H2O for further modeling. H2O can read either CSV or SVMLight formats.

The first one I've created is 987753 x 8806 sparse Matrix of class "dgCMatrix", with 3625049 entries, so it's pretty big. It's not possible to use as.matrix() to write it out to CSV since it's too big. I thought that I might be able to easily write it out as SVMLight format, but haven't been able to find a library that works. Anyone have any other options for getting this output to a file that I can read into H2O?


Solution

  • There are several packages who can do that. Take a look into https://github.com/Laurae2/sparsity - imho most promising:

    library(text2vec)
    library(sparsity)
    data("movie_review")
    N = 5000
    tokens = movie_review$review[1:N] %>% tolower %>% word_tokenizer
    it = itoken(tokens, progressbar = T)
    dtm = create_dtm(it, hash_vectorizer())
    write.svmlight(dtm, labelVector = movie_review$sentiment, file = "dtm.svmlight")