python machine-learning scikit-learn statistics cross-validation

Why data cleaning decreases accuracy?

Using the 20newsgroups from the scikit learn for reproducibility. When I train an svm model and then perform data cleaning by removing headers, footers and quotes the accuracy decreases. Isn't it supposed to be improved by data cleaning? What is the point in doing all that and then get worse accuracy?

I have created this example with data cleaning to help you understand what I am referring at:

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
categories = ['alt.atheism', 'comp.graphics']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=2017, 
                                    remove=('headers', 'footers', 'quotes')  )
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories,shuffle=True, random_state=2017, 
                                    remove=('headers', 'footers', 'quotes') )
y_train = newsgroups_train.target
y_test = newsgroups_test.target

vectorizer = TfidfVectorizer(sublinear_tf=True, smooth_idf = True, max_df=0.5,  ngram_range=(1, 2),stop_words='english')
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)


from sklearn.svm import SVC
from sklearn import metrics

clf = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)
clf = clf.fit(X_train, y_train)
y_train_pred = clf.predict(X_train)
y_test_pred = clf.predict(X_test)
print('Train accuracy_score: ', metrics.accuracy_score(y_train, y_train_pred))
print('Test accuracy_score: ',metrics.accuracy_score(newsgroups_test.target, y_test_pred))
print("-"*12)
print("Train Metrics: ", metrics.classification_report(y_train, y_train_pred))
print("-"*12)
print("Test Metrics: ", metrics.classification_report(newsgroups_test.target, y_test_pred))

Results before data cleaning:

Train accuracy_score:  1.0
Test accuracy_score:  0.9731638418079096

Results after data cleaning:

Train accuracy_score:  0.9887218045112782
Test accuracy_score:  0.9209039548022598

Solution

It is not necessarily your data cleaning, I assume you run the script twice?

The problem is this line of code:

clf = SVC(C=10, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1, kernel='rbf', max_iter=-1,
    probability=False, random_state=None, shrinking=True, tol=0.001,
    verbose=False)

random_state=NoneYou should fix random state to e.g. random_state=42, otherwise you cannot produce the same result, if you would run this code again right now, you will again have a different result.

Edit:

The explanation is on the dataset site itself: If you implement:

import numpy as np
def show_top10(classifier, vectorizer, categories):
 feature_names = np.asarray(vectorizer.get_feature_names())
 for i, category in enumerate(categories):
   top10 = np.argsort(classifier.coef_[i])[-10:]
   print("%s: %s" % (category, " ".join(feature_names[top10])))

You can now see many things that these features have overfit to:

Almost every group is distinguished by whether headers such as NNTP-Posting-Host: and Distribution: appear more or less often.

Another significant feature involves whether the sender is affiliated with a university, as indicated either by their headers or their signature.

The word “article” is a significant feature, based on how often people quote previous posts like this: “In article [article ID], [name] <[e-mail address]> wrote:”

Other features match the names and e-mail addresses of particular people who were posting at the time.

With such an abundance of clues that distinguish newsgroups, the classifiers barely have to identify topics from text at all, and they all perform at the same high level.

For this reason, the functions that load 20 Newsgroups data provide a parameter called remove, telling it what kinds of information to strip out of each file. remove should be a tuple containing any subset of

Summarize:

The remove thingy prevents you from data leakage, that means you have information in your training data which you will not have in a prediction phase, so you have to remove it, otherwise you will get a better result, but this will be not there for new data.