I have a machine learning task at hand for which I would like to try Bernoulli Naive Bayes.
Since I need to produce the some meaningful results very soon, I would like to use Python and, more specifically, sklearn
. The data is "simple" but I have a lot of it and so I am trying to figure out the right approach that will allow me to write a "quick and dirty" BernoulliNB-based prototype that I can apply to as much data as possible.
The details are:
True
/ False
).fit()
train my modelI haven't preprocessed the actual data yet so I don't have the actual feature matrices and class vectors for training, but while I am doing the preprocessing, I want to figure out how big of data chunk I can process. What I am essentially trying to do is rewrite the following block of code so that it can work with the specified values of nSamples
and nFeatures
:
from sklearn.naive_bayes import BernoulliNB
import numpy as np
nSamples = 200000
nFeatures = 30000
# Don't care about actual values yet, just data size
X = np.random.randint( 2, size = ( nSamples, nFeatures ) )
Y = np.random.randint( 2, size = ( nSamples, ) )
clf = BernoulliNB()
clf.fit( X, Y )
res = clf.predict_proba( X[2] )
a) What would be "best practices" approaches to this?
b) Do I need to incorporate PyTables
?
c) Can sklearn
work with PyTables
objects?
You need to figure out how much of this data can fit in memory.
If your matrix is sparse you don't need to break it in chunks. It does not look like yours is though.
BernoulliNB
and many scikit-learn classifiers have a partial_fit
method that does just that (see this more complete example):
clf = BernoulliNB()
all_classes = [0, 1]
for X_train, y_train in iter_batches:
clf.partial_fit(X_train, y_train, classes=all_classes)
Where iter_batches
is an iterator that gives you chunks of data.
Now you need to make sure these chunks fit in memory.
You can figure out the size of a np.array
using the nbytes
attribute:
from sklearn.naive_bayes import BernoulliNB
import numpy as np
nSamples = 2000
nFeatures = 30000
X = np.random.randint(2, size=(nSamples,nFeatures))
X.nbytes / 10 ** 6
Out[11]: 480.0
So here the X
array is about 480MB in memory.
Note that if you have boolean variables and specify the types properly when loading the data you can get a much reduced footprint:
X = np.random.randint(2, size=(nSamples,nFeatures)).astype(np.int8)
X.nbytes / 10 ** 6
Out[12]: 60.0
A np.bool
is still 1 byte (8 bits) though.
You can also compute these numbers by hand: the array will be about nSamples * nFeatures * 1 / 10 ** 6
MB.
The rest depends on the RAM you have available. The entire X
array is 6GB, but you'll need to account for the RAM that scikit-learn will need.
"That should not be a lot" is all I can say with some confidence ;).
Don't forget to pass binarize=None
to the BernoulliNB
constructor to avoid a copy of your X array though (your data is already binarized).
Do you need PyTables
? No. But you can still use it if you'd like.
sklearn
works with numpy arrays but so does PyTables
, so you could use it to feed chunks of data to your partial_fit
loop.
Hope this helps.