scikit-learn online-algorithm large-data

Possibility to apply online algorithms on big data files with sklearn?

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora. My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory. Is it possible to do this with sklearn ? are there alternatives ?

Thanks register

Solution

Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer.

EDIT: Here is a full-fledged example of such an application

Basically, this example demonstrates that you can learn (e.g. classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...).

StratifiedKFold vs KFold in scikit-learn
Macro VS Micro VS Weighted VS Samples F1 Score
How to pass only necessary features to pipeline after SelectKBest
How to define the search space for a simple equation optimization
TypeError: Feature names are only supported if all input features have string names, but your input has ['str', 'str_'] as column name types
How to create image of confusion matrix in Python
Pass parameters to custom transformer in sklearn
How does sklearn compute the precision_score metric?
In scikit's precision_recall_curve, why does thresholds have a different dimension from recall and precision?
python: How to get real feature name from feature_importances
LogisticRegression: Unknown label type: 'continuous' using sklearn in python
fit method in sklearn
Why lightgbm .predict function has probabilities not between 0 and 1?
The easiest way for getting feature names after running SelectKBest in Scikit Learn
Which estimators in scikit-learn support `partial_fit` API?
How to retrieve the mapping generated from a category_encoder in python?
How to change max_iter in optimize function used by sklearn gaussian process regression?
Predict training data in sklearn
'super' object has no attribute '__sklearn_tags__'
visualize 10x10 grid of each digit using MNIST samples
How to get coefficients of multinomial logistic regression?
Training difference between LightGBM API and Sklearn API
Why can't I wrap LGBM?
displaying scikit decision tree figure in jupyter notebook
What's the best way to use a sklearn feature selector in a grid search, to evaluate the usefulness of all features?
AdaBoostClassifier: Perfect Metrics with test_size=0.25, but Inconsistent Samples Error for Other Values
Linear Model in Julia
Python - generate array of specific autocorrelation
How to specify the levels to iterate in a grid search with an ensemble classifier?
How can I silence `UndefinedMetricWarning`?