Search code examples
text2vec

The compatibility between text2vec and RHadoop


At present, we are using text2vec processing large dataset in AWS EC2(single instance), the text data will bigger and bigger in the future, we may try to RHadoop(MapReduce) architecture and don't know if it can be compatibility between text2vec and RHadoop(MapReduce).


Solution

  • The short answer is yes - if you really want you can make anything work with RHadoop. But I'm pretty sure that effort will be substantial and probably you won't be satisfied with results.

    Coming back to real problem. Worth to try text2vec version 0.5 (which was released last week) - it consumes even less ram than before. Also you can easily process data with chunks and in parallel. Check this vignette for example.

    Another thing is that for basic tasks like classification you usually don't need all the data in RAM. You can check for example another my package - FTRL for fitting logistic regression (with L1/L2/elasticnet penalty) with SGD incrementally.

    Would be great to have report on github from you about memory problem (which is actually coming from Matrix package).

    PS tree methods and ensembles usually not good with sparse high-dimensional data.