I am trying to implement Naive Bayes Gaussian classifier on the number classification data. Where each feature represents a pixel.
When trying to implement this, I hit a bump, I've noticed that some the feature variance equate to 0. This is an issue because I would not be able to divide by 0 when trying to solve the probability.
What can I do to work around this?
Very short answer is you cannot - even though you can usually try to fit Gaussian distribution to any data (no matter its true distribution) there is one exception - the constant case (0 variance). So what can you do? There are three main solutions:
Ignore 0-variance pixels. I do not recommend this approach as it loses information, but if it is 0 variance for each class (which is a common case for MNIST - some pixels are black, independently from class) then it is actually fully justified mathematically. Why? The answer is really simple, if for each class, given feature is constant (equal to some single value) then it brings literally no information for classification, thus ignoring it will not affect the model which assumes conditional independence of features (such as NB).
Instead of doing MLE estimate (so using N(mean(X), std(X))) use the regularised estimator, for example of form N(mean(X), std(X) + eps), which is equivalent to adding eps-noise independently to each pixel. This is a very generic approach that I would recommend.
Use better distribution class, if your data is images (and since you have 0 variance I assume these are binary images, maybe even MNIST) you have K features, each in [0, 1] interval. You can use multinomial distribution with bucketing, so P(x e Bi|y) = #{ x e Bi | y } / #{ x | y }. Finally this is usually the best thing to do (however requires some knowledge of your data), as the problem is you are trying to use a model which is not suited for the data provided, and I can assure you, that proper distribution will always give better results with NB. So how can you find a good distribution? Plot conditonal marginals P(xi|y) for each feature, and look how they look like, based on that - choose distribution class which matches the behaviour, I can assure you these will not look like Gaussians.