Search code examples
pythonnltknaivebayestextblob

After training my own classifier with nltk, how do I load it in textblob?


The built-in classifier in textblob is pretty dumb. It's trained on movie reviews, so I created a huge set of examples in my context (57,000 stories, categorized as positive or negative) and then trained it using nltk. I tried using textblob to train it but it always failed:

with open('train.json', 'r') as fp:
    cl = NaiveBayesClassifier(fp, format="json")

That would run for hours and end in a memory error.

I looked at the source and found it was just using nltk and wrapping that, so I used that instead, and it worked.

The structure for nltk training set needed to be a list of tuples, with the first part was a Counter of words in the text and frequency of appearance. The second part of tuple was 'pos' or 'neg' for sentiment.

>>> train_set = [(Counter(i["text"].split()),i["label"]) for i in data[200:]]
>>> test_set = [(Counter(i["text"].split()),i["label"]) for i in data[:200]] # withholding 200 examples for testing later

>>> cl = nltk.NaiveBayesClassifier.train(train_set) # <-- this is the same thing textblob was using

>>> print("Classifier accuracy percent:",(nltk.classify.accuracy(cl, test_set))*100)
('Classifier accuracy percent:', 66.5)
>>>>cl.show_most_informative_features(75)

Then I pickled it.

with open('storybayes.pickle','wb') as f:
    pickle.dump(cl,f)

Now... I took this pickled file, and re opened it to get the nltk.classifier 'nltk.classify.naivebayes.NaiveBayesClassifier'> -- and tried to feed it into textblob. Instead of

from textblob.classifiers import NaiveBayesClassifier
blob = TextBlob("I love this library", analyzer=NaiveBayesAnalyzer())

I tried:

blob = TextBlob("I love this library", analyzer=myclassifier)
Traceback (most recent call last):
  File "<pyshell#116>", line 1, in <module>
    blob = TextBlob("I love this library", analyzer=cl4)
  File "C:\python\lib\site-packages\textblob\blob.py", line 369, in __init__
    parser, classifier)
  File "C:\python\lib\site-packages\textblob\blob.py", line 323, in 
_initialize_models
    BaseSentimentAnalyzer, BaseBlob.analyzer)
  File "C:\python\lib\site-packages\textblob\blob.py", line 305, in 
_validated_param
    .format(name=name, cls=base_class_name))
ValueError: analyzer must be an instance of BaseSentimentAnalyzer

what now? I looked at the source and both are classes, but not quite exactly the same.


Solution

  • Another more forward-looking solution is to use spaCy to build the model instead of textblob or nltk. This is new to me, but seems a lot easier to use and more powerful: https://spacy.io/usage/spacy-101#section-lightning-tour

    "spaCy is the Ruby of Rails of natural language processing."

    import spacy
    import random
    
    nlp = spacy.load('en') # loads the trained starter model here
    train_data = [("Uber blew through $1 million", {'entities': [(0, 4, 'ORG')]})] # better model stuff
    
    with nlp.disable_pipes(*[pipe for pipe in nlp.pipe_names if pipe != 'ner']):
        optimizer = nlp.begin_training()
        for i in range(10):
            random.shuffle(train_data)
            for text, annotations in train_data:
                nlp.update([text], [annotations], sgd=optimizer)
    nlp.to_disk('/model')