Search code examples
pythonmachine-learningscikit-learnclassificationnaivebayes

Spam Filter - Python newbie


So I have the task of creating a classification algorithm in Python for an email dataset: https://archive.ics.uci.edu/ml/datasets/spambase

I need to be able to process the dataset, apply my classification algorithm ( I have chosen 3 naive bayes versions ), print the accuracy score to the terminal and perform a 5 or 10 fold cross validation and find out how many emails are spam.

As you can see I have done some of the tasks but missing the cross validation and finding out how many emails are spam.

import numpy as np
import pandas as pd 

import sklearn   
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split

from sklearn import metrics
from sklearn.metrics import accuracy_score

# Read data
dataset = pd.read_csv('dataset.csv').values

# What shuffle does? How it helps?
np.random.shuffle(dataset)


X = dataset[ : , :48 ]
Y = dataset[ : , -1 ]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = .33, random_state = 17)

# Bernoulli Naive Bayes
BernNB = BernoulliNB(binarize = True)
BernNB.fit(X_train, Y_train)
y_expect = Y_test
y_pred = BernNB.predict(X_test)   
print ("Bernoulli Accuracy Score: ")
print (accuracy_score(y_expect, y_pred))

# Multinomial Naive Bayes
MultiNB = MultinomialNB()
MultiNB.fit(X_train, Y_train)
y_pred = MultiNB.predict(X_test)
print ("Multinomial Accuracy Score: ")
print (accuracy_score(y_expect, y_pred))

# Gaussian Naive Bayes
GausNB = GaussianNB()
GausNB.fit(X_train, Y_train)
y_pred = GausNB.predict(X_test)
print ("Gaussian Accuracy Score: ")
print (accuracy_score(y_expect, y_pred))

# Bernoulli ALTERED Naive Bayes
BernNB = BernoulliNB(binarize = 0.1)
BernNB.fit(X_train, Y_train)
y_expect = Y_test
y_pred = BernNB.predict(X_test)   
print ("Bernoulli 'Altered' Accuracy Score: ")
print (accuracy_score(y_expect, y_pred))

I have looked into cross validation and think I can apply this now, but its finding out how many emails are spam that I dont understand??? I have the different navie bayes versions accuracy, but how would I actually find the number of spam? The last column is either 1 or 0 and that defines if its spam or not? So I dont know how to go about it


Solution

  • Since your class label 1 means spam, accuracy value you are calculating using accuracy_score will give you the number of spam emails that are correctly identified as spam. For example, 90% test accuracy implies 90 out of 100 test spam emails are correctly classified as spam.

    Use sklearn.metrics.confusion_matrix(y_expect, y_pred) for individual class level breakdown.

    sklearn Doc

    For example:

    If y_expect = [1,1,0,0,1] It mean you have 3 spam email and 2 non spam emails in your test data, and if y_pred = [1,1,1,0,1] then it mean your model have detected 3 of the spam emails correctly but also detected 1 non spam email as spam.