Search code examples
scikit-learnonline-algorithmlarge-data

Possibility to apply online algorithms on big data files with sklearn?


I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora. My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory. Is it possible to do this with sklearn ? are there alternatives ?

Thanks register


Solution

  • Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer.

    EDIT: Here is a full-fledged example of such an application

    Basically, this example demonstrates that you can learn (e.g. classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...).