python artificial-intelligence neural-network spam-prevention

Strange FANN behavior in spam classification task

I tried to write simple spam classifier with help of FANN library. To do it I collected number of spam and ham email letters, and collected a dictionary of the most used English words. I created a neural network with one hidden layer with the following code:

num_input = get_input_size(dictionary_size)
num_output = 1

ann.create_standard_array((num_input, num_neurons_hidden, num_output))
ann.set_activation_function_hidden(libfann.SIGMOID_SYMMETRIC)
ann.set_activation_function_output(libfann.SIGMOID_SYMMETRIC)
ann.set_training_algorithm(libfann.TRAIN_INCREMENTAL)

Output is 1 when the letter is ham and -1 when it is spam. Each of the input neurons represent if a specific word was in an e-mail or not (1 - word was in a mail. 0 - was not)

To train neural network I use the following code. (For each e-mail letter in a training set)

# Create input from train e-mail letter
input = get_input(train_res, train_file, dictionary)               
ann.train(input, (train_res,))

To check if an e-mail from a test set is a spam or not I use the following code: (For each e-mail in a test set)

input = get_input(SPAM, test_spam, dictionary)
res = ann.run(input)[0]

But no matter what size of dictionary I use (I tried from 1000 words to 40000 words) or number of neurons in the hidden layer (20 to 640) after my network is trained it assumes that almost all e-mails are spam or ham. For example I receive either results like this:

Dictionary size: 10000
Hidden layer size: 80
Correctly classified hams: 596
Incorrectly classified hams: 3845
Correctly classified spams: 436
Incorrectly classified spams: 62

where almost all spams are classified correctly, but all hams are misclassified, or results like this:

Dictionary size: 20000
Hidden layer size: 20
Correctly classified hams: 4124
Incorrectly classified hams: 397
Correctly classified spams: 116
Incorrectly classified spams: 385

that are oposite. I tried to use more training data. I started with approximately 1000 e-mails in training set (the proportion of spam to ham is almost 50:50) and now I am testing it with approximately 4000 e-mails (spam:ham approximately 50:50), but result is the same.

What is the possible problem? Thank you in advance.

Solution

Have you asserted that there is a significant difference between spam and ham mails in terms of their content of the words in your wordlist? My guess would be that there might not be a very clear difference between spam and ham when it comes to the content of regular words.

If you are using 'real' spam mails many spamers use something known as Bayesian poisoning where they include lots of 'legitimate' text in order to confuse spam filters. Since you simply filter on content of usual word and not on words statistically common to spam/ham your approach will be very sensitive to bayesian poisoning.