Fast counting of feature values in Naive Bayes classifier

I am implementing a Naive Bayes classifier in Python (as part of a university assignment, so Python is a requirement). I got it to work and it produces about the same result as sklearn.naive_bayes.MultinomialNB. However, it's really slow compared to the sklearn implementation.

Suppose feature values are integers in the range of 0 to max_i, and class labels are also integers in the range of 0 to max_y. An example dataset looks like this:

>>> X = np.array([2,1 1,2 2,2 0,2]).reshape(4,2) # design matrix
>>> print(X)
[[2 1]
 [1 2]
 [2 2]
 [0 2]]
>>> y = np.array([0,  1,  2,  0  ]) # class labels
>>> print(y)
[0 1 2 0]

Now, as an intermediate step before dealing with the joint log likelihood, I need to compute class conditional likelihoods (i.e. P(x_ij | y) such that a matrix ccl contains the probability of value k in feature j given class c. The output of such a matrix for the example above would be:

>>> print(ccl)
[[[0.5 0.  0.5]
  [0.  0.5 0.5]]

 [[0.  1.  0. ]
  [0.  0.  1. ]]

 [[0.  0.  1. ]
  [0.  0.  1. ]]]
>>> print(ccl[0][1][1]) # prob. of value 1 in feature 1 given class 0
0.5

The code I implemented to achieve this looks like this:

N, D = X.shape
K = np.max(X)+1
C = np.max(y)+1
ccl = np.zeros((C,D,K))
# ccl = ccl + alpha - 1 # disregard the dirichlet prior for this question
# Count occurences of feature values given class c
for i in range(N):
    for d in range(D):
        ccl[y[i]][d][X[i][d]] += 1
# Renormalize so it becomes a probability distribution again
for c in range(C):
    for d in range(D):
        cls[c][d] = np.divide(cls[c][d], np.sum(cls[c][d]))

So since Python loops are slow, this also becomes slow. I tried to mitigate this problem by one-hot encoding every feature value (so if the feature values are in range [0,1,2], a 2 becomes: [0,0,1] and so on.) and summing up like this. Although, I think too many np functions are called so the computation still takes too long:

ccl = np.zeros((C,D,K))
for c in range(C):
    x = np.eye(K)[X[np.where(y==c)]] # one hot encoding
    ccl[c] += np.sum(x, axis=0) # summing up
    ccl[c] /= ccl[c].sum(axis=1)[:, numpy.newaxis] # renormalization

This would result in the same output as above. Any hints on how to make this faster? I think np.eye (the one-hot encoding) is unnecessary and kills it but I can't think of a way to get rid of it. One last thing I considered is using np.unique() or collections.Counter for counting but haven't figured it out yet.

Solution

So this is a pretty neat problem (and I had a similar problem not that long ago). It looks like the fastest way to handle this is usually to construct an index array using just arithmetic operations and then pile it up and reshape it with np.bincount.

N, D = X.shape
K = np.max(X) + 1
C = np.max(y) + 1
ccl = np.tile(y, D) * D * K + (X +  np.tile(K * range(D), (N,1))).T.flatten()
ccl = np.bincount(ccl, minlength=C*D*K).reshape(C, D, K)
ccl = np.divide(ccl, np.sum(ccl, axis=2)[:, :, np.newaxis])

>>> ccl
array([[[0.5, 0. , 0.5],
        [0. , 0.5, 0.5]],

       [[0. , 1. , 0. ],
        [0. , 0. , 1. ]],

       [[0. , 0. , 1. ],
        [0. , 0. , 1. ]]])

As a comparison for speed, funca is your first loop-based method, funcb is your second numpy function-based method, and funcc is the method using bincount.

X = np.random.randint(3, size=(10000,2))
y = np.random.randint(3, size=(10000))
>>> timeit.timeit('funca(X,y)', number=100, setup="from __main__ import funca, X, y")
2.632569645998956
>>> timeit.timeit('funcb(X,y)', number=100, setup="from __main__ import funcb, X, y")
0.10547748399949342
>>> timeit.timeit('funcc(X,y)', number=100, setup="from __main__ import funcc, X, y")
0.03524605900020106

It might be possible to further refine this, but I don't have any more good ideas.