I would like to classify text from documents into different categories. Each document can only go into one of the following category: PR, AR, KID, SAR.
I found an example using scikit-learn and am able to use it:
import numpy
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from pandas import DataFrame
def build_data_frame(path, classification):
rows = []
index = []
f = open(path, mode = 'r', encoding="utf8")
txt = f.read()
rows.append({'text': txt, 'class': classification})
index.append(path)
data_frame = DataFrame(rows, index=index)
return data_frame
# Categories
PR = 'PR'
AR = 'AR'
KID = 'KID'
SAR = 'SAR'
# Training documents
SOURCES = [
(r'C:/temp_training/PR/PR1.txt', PR),
(r'C:/temp_training/PR/PR2.txt', PR),
(r'C:/temp_training/PR/PR3.txt', PR),
(r'C:/temp_training/PR/PR4.txt', PR),
(r'C:/temp_training/PR/PR5.txt', PR),
(r'C:/temp_training/AR/AR1.txt', AR),
(r'C:/temp_training/AR/AR2.txt', AR),
(r'C:/temp_training/AR/AR3.txt', AR),
(r'C:/temp_training/AR/AR4.txt', AR),
(r'C:/temp_training/AR/AR5.txt', AR),
(r'C:/temp_training/KID/KID1.txt', KID),
(r'C:/temp_training/KID/KID2.txt', KID),
(r'C:/temp_training/KID/KID3.txt', KID),
(r'C:/temp_training/KID/KID4.txt', KID),
(r'C:/temp_training/KID/KID5.txt', KID),
(r'C:/temp_training/SAR/SAR1.txt', SAR),
(r'C:/temp_training/SAR/SAR2.txt', SAR),
(r'C:/temp_training/SAR/SAR3.txt', SAR),
(r'C:/temp_training/SAR/SAR4.txt', SAR),
(r'C:/temp_training/SAR/SAR5.txt', SAR)
]
# Real documents
TESTS = [
(r'C:/temp_testing/PR/PR1.txt'),
(r'C:/temp_testing/PR/PR2.txt'),
(r'C:/temp_testing/PR/PR3.txt'),
(r'C:/temp_testing/PR/PR4.txt'),
(r'C:/temp_testing/PR/PR5.txt'),
(r'C:/temp_testing/AR/AR1.txt'),
(r'C:/temp_testing/AR/AR2.txt'),
(r'C:/temp_testing/AR/AR3.txt'),
(r'C:/temp_testing/AR/AR4.txt'),
(r'C:/temp_testing/AR/AR5.txt'),
(r'C:/temp_testing/KID/KID1.txt'),
(r'C:/temp_testing/KID/KID2.txt'),
(r'C:/temp_testing/KID/KID3.txt'),
(r'C:/temp_testing/KID/KID4.txt'),
(r'C:/temp_testing/KID/KID5.txt'),
(r'C:/temp_testing/SAR/SAR1.txt'),
(r'C:/temp_testing/SAR/SAR2.txt'),
(r'C:/temp_testing/SAR/SAR3.txt'),
(r'C:/temp_testing/SAR/SAR4.txt'),
(r'C:/temp_testing/SAR/SAR5.txt')
]
data_train = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
data_train = data_train.append(build_data_frame(path, classification))
data_train = data_train.reindex(numpy.random.permutation(data_train.index))
examples = []
for path in TESTS:
f = open(path, mode = 'r', encoding = 'utf8')
txt = f.read()
examples.append(txt)
target_names = [PR, AR, KID, SAR]
classifier = Pipeline([
('vectorizer', CountVectorizer(ngram_range=(1, 2), analyzer='word', strip_accents='unicode', stop_words='english')),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(LinearSVC()))])
classifier.fit(data_train['text'], data_train['class'])
predicted = classifier.predict(examples)
print(predicted)
Output:
['PR' 'PR' 'PR' 'PR' 'PR' 'AR' 'AR' 'AR' 'AR' 'AR' 'KID' 'KID' 'KID' 'KID'
'KID' 'AR' 'AR' 'AR' 'SAR' 'AR']
PR, AR and KID are perfectly recognized.
However, SAR documents (the last 5) are not correctly classified except one of them. SAR and AR are quite similar, which can explain why the algorithm gets confused.
I tried to play with the n-grams value, but 1 (min) and 2 (max) seems to give the best results.
Any idea how to increase the precision to distinguish between AR and SAR categories?
Is there a way to display the percentage of recognition for a particular document? i.e. PR (70%), meaning that the algorithm is 70% confident on the prediction
If you need the documents, here is the dataset: http://1drv.ms/21dnL6j
This is not strictly a programming question so I suggest you to try posting it to a more Data Science related stack.
Anyhow some things you can try:
classifier.predict_proba
function instead of classifier.predict
one.Good luck!