python scikit-learn sparse-matrix pytables naivebayes

Practical limits to data size for sklearn.naive_bayes.BernoulliNB

I have a machine learning task at hand for which I would like to try Bernoulli Naive Bayes.
Since I need to produce the some meaningful results very soon, I would like to use Python and, more specifically, sklearn. The data is "simple" but I have a lot of it and so I am trying to figure out the right approach that will allow me to write a "quick and dirty" BernoulliNB-based prototype that I can apply to as much data as possible.

The details are:

Features are binary ( True / False )
Classes are binary too ( think of this as a spam filter )
A feature vector is of a length up to 30,000. It might be able to reduce this considerably by a feature selection, but for now let's assume it's this long
I have up to 200,000 data points that I can use to .fit() train my model

I haven't preprocessed the actual data yet so I don't have the actual feature matrices and class vectors for training, but while I am doing the preprocessing, I want to figure out how big of data chunk I can process. What I am essentially trying to do is rewrite the following block of code so that it can work with the specified values of nSamples and nFeatures:

from sklearn.naive_bayes import BernoulliNB
import numpy as np

nSamples = 200000
nFeatures =  30000

# Don't care about actual values yet, just data size
X = np.random.randint( 2, size = ( nSamples, nFeatures ) )
Y = np.random.randint( 2, size = ( nSamples, ) )

clf = BernoulliNB()
clf.fit( X, Y )

res = clf.predict_proba( X[2] )

a) What would be "best practices" approaches to this?
b) Do I need to incorporate PyTables?
c) Can sklearn work with PyTables objects?

Solution

You need to figure out how much of this data can fit in memory.

If your matrix is sparse you don't need to break it in chunks. It does not look like yours is though.

Processing data in chunks

BernoulliNB and many scikit-learn classifiers have a partial_fit method that does just that (see this more complete example):

clf = BernoulliNB()
   all_classes = [0, 1]
   for X_train, y_train in iter_batches:
       clf.partial_fit(X_train, y_train, classes=all_classes)

Where iter_batches is an iterator that gives you chunks of data.
Now you need to make sure these chunks fit in memory.

How big is it?

You can figure out the size of a np.array using the nbytes attribute:

from sklearn.naive_bayes import BernoulliNB
import numpy as np

nSamples = 2000
nFeatures =  30000
X = np.random.randint(2, size=(nSamples,nFeatures))
X.nbytes / 10 ** 6
Out[11]: 480.0

So here the X array is about 480MB in memory.
Note that if you have boolean variables and specify the types properly when loading the data you can get a much reduced footprint:

X = np.random.randint(2, size=(nSamples,nFeatures)).astype(np.int8)
X.nbytes / 10 ** 6
Out[12]: 60.0

A np.bool is still 1 byte (8 bits) though.

You can also compute these numbers by hand: the array will be about nSamples * nFeatures * 1 / 10 ** 6 MB.

The rest depends on the RAM you have available. The entire X array is 6GB, but you'll need to account for the RAM that scikit-learn will need. "That should not be a lot" is all I can say with some confidence ;).
Don't forget to pass binarize=None to the BernoulliNB constructor to avoid a copy of your X array though (your data is already binarized).

PyTables

Do you need PyTables? No. But you can still use it if you'd like. sklearn works with numpy arrays but so does PyTables, so you could use it to feed chunks of data to your partial_fit loop.

Hope this helps.