python scikit-learn python-3.4 document-classification

use scikit-learn to distinguish between similar categories

I would like to classify text from documents into different categories. Each document can only go into one of the following category: PR, AR, KID, SAR.

I found an example using scikit-learn and am able to use it:

import numpy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from pandas import DataFrame

def build_data_frame(path, classification):
    rows = []
    index = []

    f = open(path, mode = 'r', encoding="utf8")
    txt = f.read()

    rows.append({'text': txt, 'class': classification})
    index.append(path)

    data_frame = DataFrame(rows, index=index)
    return data_frame

# Categories
PR = 'PR'
AR = 'AR'
KID = 'KID'
SAR = 'SAR'

# Training documents
SOURCES = [
    (r'C:/temp_training/PR/PR1.txt', PR),
    (r'C:/temp_training/PR/PR2.txt', PR),
    (r'C:/temp_training/PR/PR3.txt', PR),
    (r'C:/temp_training/PR/PR4.txt', PR),
    (r'C:/temp_training/PR/PR5.txt', PR),
    (r'C:/temp_training/AR/AR1.txt', AR),
    (r'C:/temp_training/AR/AR2.txt', AR),
    (r'C:/temp_training/AR/AR3.txt', AR),
    (r'C:/temp_training/AR/AR4.txt', AR),
    (r'C:/temp_training/AR/AR5.txt', AR),
    (r'C:/temp_training/KID/KID1.txt', KID),
    (r'C:/temp_training/KID/KID2.txt', KID),
    (r'C:/temp_training/KID/KID3.txt', KID),
    (r'C:/temp_training/KID/KID4.txt', KID),
    (r'C:/temp_training/KID/KID5.txt', KID),
    (r'C:/temp_training/SAR/SAR1.txt', SAR),
    (r'C:/temp_training/SAR/SAR2.txt', SAR),
    (r'C:/temp_training/SAR/SAR3.txt', SAR),
    (r'C:/temp_training/SAR/SAR4.txt', SAR),
    (r'C:/temp_training/SAR/SAR5.txt', SAR)
]

# Real documents
TESTS = [
    (r'C:/temp_testing/PR/PR1.txt'),
    (r'C:/temp_testing/PR/PR2.txt'),
    (r'C:/temp_testing/PR/PR3.txt'),
    (r'C:/temp_testing/PR/PR4.txt'),
    (r'C:/temp_testing/PR/PR5.txt'),
    (r'C:/temp_testing/AR/AR1.txt'),
    (r'C:/temp_testing/AR/AR2.txt'),
    (r'C:/temp_testing/AR/AR3.txt'),
    (r'C:/temp_testing/AR/AR4.txt'),
    (r'C:/temp_testing/AR/AR5.txt'),
    (r'C:/temp_testing/KID/KID1.txt'),
    (r'C:/temp_testing/KID/KID2.txt'),
    (r'C:/temp_testing/KID/KID3.txt'),
    (r'C:/temp_testing/KID/KID4.txt'),
    (r'C:/temp_testing/KID/KID5.txt'),
    (r'C:/temp_testing/SAR/SAR1.txt'),
    (r'C:/temp_testing/SAR/SAR2.txt'),
    (r'C:/temp_testing/SAR/SAR3.txt'),
    (r'C:/temp_testing/SAR/SAR4.txt'),
    (r'C:/temp_testing/SAR/SAR5.txt')
]

data_train = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
    data_train = data_train.append(build_data_frame(path, classification))

data_train = data_train.reindex(numpy.random.permutation(data_train.index))

examples = []

for path in TESTS:
    f = open(path, mode = 'r', encoding = 'utf8')
    txt = f.read()

    examples.append(txt)

target_names = [PR, AR, KID, SAR]

classifier = Pipeline([
    ('vectorizer', CountVectorizer(ngram_range=(1, 2), analyzer='word', strip_accents='unicode', stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(data_train['text'], data_train['class'])
predicted = classifier.predict(examples)

print(predicted)

Output:

['PR' 'PR' 'PR' 'PR' 'PR' 'AR' 'AR' 'AR' 'AR' 'AR' 'KID' 'KID' 'KID' 'KID'
 'KID' 'AR' 'AR' 'AR' 'SAR' 'AR']

PR, AR and KID are perfectly recognized.

However, SAR documents (the last 5) are not correctly classified except one of them. SAR and AR are quite similar, which can explain why the algorithm gets confused.

I tried to play with the n-grams value, but 1 (min) and 2 (max) seems to give the best results.

Any idea how to increase the precision to distinguish between AR and SAR categories?
Is there a way to display the percentage of recognition for a particular document? i.e. PR (70%), meaning that the algorithm is 70% confident on the prediction

If you need the documents, here is the dataset: http://1drv.ms/21dnL6j

Solution

This is not strictly a programming question so I suggest you to try posting it to a more Data Science related stack.

Anyhow some things you can try:

Use some other classifier.
Tune the classifier hyperparameters using a grid search.
Use OneVsOne instead OneVsAll as strategy. This will likely help you differentiate SAR from AR.
For "display the percentage of recognition for a particular document" you can use the probability outputs coming from some models. Use the classifier.predict_proba function instead of classifier.predict one.

Good luck!