Search code examples
pythonpython-3.xmachine-learningnaivebayes

How to make and use Naive Bayes Classifier with Scikit


I'm following a book about machine learning in python and I just don't understand this code:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.naive_bayes import GaussianNB 
from sklearn import cross_validation

from utilities import visualize_classifier

# Input file containing data
input_file = 'data_multivar_nb.txt'

# Load data from input file
data = np.loadtxt(input_file, delimiter=',')
X, y = data[:, :-1], data[:, -1] 

# Create Naive Bayes classifier 
classifier = GaussianNB()

# Train the classifier
classifier.fit(X, y)

# Predict the values for training data
y_pred = classifier.predict(X)

# Compute accuracy
accuracy = 100.0 * (y == y_pred).sum() / X.shape[0]
print("Accuracy of Naive Bayes classifier =", round(accuracy, 2), "%")

I just have a few questions:

What does data[:, :-1] and data[:, -1] do? The input file is in the form of:

2.18,0.57,0
4.13,5.12,1
9.87,1.95,2
4.02,-0.8,3
1.18,1.03,0
4.59,5.74,1

How does the computing accuracy part work? What is X.shape[0]? Lastly how do I use the classifier to predict the y for new values?


Solution

  • When you index a numpy array you use square brackets similar to a list.

    my_list[-1] returns the last item in the list.

    For example.

    my_list = [1, 2, 3, 4]
    my_list[-1]
    4
    

    If you're familiar with list indexing then you will know what a slice is.

    my_list[:-1] returns all items from the beginning to the last-but-one.

    my_list[:-1]
    [1, 2, 3]
    

    In your code, data[:, :-1] is simply indexing with slices in 2-dimensions. Lookup the documentation on numpy arrays for more information. Understanding ndarrays is a pre-requisite for using sklearn.