I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora. My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory. Is it possible to do this with sklearn ? are there alternatives ?
Thanks register
Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer.
EDIT: Here is a full-fledged example of such an application
Basically, this example demonstrates that you can learn (e.g. classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...).