Search code examples
pythonnlpnltktext-classificationtextblob

python textblob and text classification


I'm trying do build a text classification model with python and textblob, the script is runing on my server and in the future the idea is that users will be able to submit their text and it will be classified. i'm loading the training set from csv :

# -*- coding: utf-8 -*-
import sys
import codecs
sys.stdout = open('yyyyyyyyy.txt',"w");
from nltk.tokenize import word_tokenize
from textblob.classifiers import NaiveBayesClassifier
with open('file.csv', 'r', encoding='latin-1') as fp:
    cl = NaiveBayesClassifier(fp, format="csv")  

print(cl.classify("some text"))

csv is about 500 lines long (with string between 10 and 100 chars), and NaiveBayesclassifier needs about 2 minutes for training and then be able to classify my text(not sure if is normal that it need so much time, maybe is my server slow with only 512mb ram).

example of csv line :

"Oggi alla Camera con la Fondazione Italia-Usa abbiamo consegnato a 140 studenti laureati con 110 e 110 lode i diplomi del Master in Marketing Comunicazione e Made in Italy.",FI-PDL

what is not clear to me, and i cant find an answer on textblob documentation, is if there is a way to 'save' my trained classifier (so save a lot of time), because by now everytime i run the script it will train again the classifier. I'm new to text classification and machine learing so my apologize if it is a dumb question.

Thanks in advance.


Solution

  • Ok found that pickle module is what i need :)

    Training:

    # -*- coding: utf-8 -*-
    import pickle
    from nltk.tokenize import word_tokenize
    from textblob.classifiers import NaiveBayesClassifier
    with open('file.csv', 'r', encoding='latin-1') as fp:
        cl = NaiveBayesClassifier(fp, format="csv")  
    
    object = cl
    file = open('classifier.pickle','wb') 
    pickle.dump(object,file)
    

    extracting:

    import pickle
    sys.stdout = open('demo.txt',"w");
    from nltk.tokenize import word_tokenize
    from textblob.classifiers import NaiveBayesClassifier
    cl = pickle.load( open( "classifier.pickle", "rb" ) )
    print(cl.classify("text to classify"))