Search code examples
machine-learningscikit-learnlarge-datalarge-fileslarge-data-volumes

Which supervised classifiers in scikit-learn are recommended for large datasets?


There are many supervised classifier algorithms available in scikit-learn but I couldn't find any information about their scaalbility regarding large datasets. I know that for instance, support vector machines don't behave well with huge datasets, but what about others? Which supervised/semi-supervised classifier algorithms are most suitable for large datasets?


Solution

  • If you are specifically looking for classifiers in sklearn, you can have a look at this link : Scaling Strategies for large datasets.

    Generally, the classifiers do incremental learning on your dataset by creating mini-batches. Here are some link for reference :

    Incremental Learning links

    You can have a look at these classifiers in SKlearn for more info

    If your data is given as a stream during input, you can have a look at Apache Spark Streaming and jump to MlLib in Apache Spark for more info.

    You can also have a look at Feature Hasher for large scale feature hashing in sklearn.