python-3.x machine-learning scikit-learn naivebayes

Informative Features Code not Working

I want to implement a most informative features function for binary NB in SciKit Learn. I am using Python3.

First off, I understand that the question of implementing some sort of 'informative features' function for SciKit's multinomial NB has been asked. However, I have tried the responses and have had no luck - so I think either SciKit updated, or I am doing something very wrong. I am using tobigue's answer here for a function.

from nltk.corpus import stopwords
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split



#Array contains a list of (headline, source) tupples where there are two sources. 
#I want to classify each headline as belonging to a given source. 
array = [('toyota showcases humanoid that mirrors user', 'drudge'), ('virginia again delays vote certification after error in ballot distribution', 'npr'), ("do doctors need to use computers? one physician's case highlights the quandary", 'npr'), ('office sex summons', 'drudge'), ('launch calibrated to avoid military response?', 'drudge'), ('snl skewers alum al franken, trump sons', 'npr'), ('mulvaney shows up for work at consumer watchdog group, as leadership feud deepens', 'npr'), ('indonesia tries to evacuate 100,000 people away from erupting volcano on bali', 'npr'), ('downing street blasts', 'drudge'), ('stocks soar more; records smashed', 'drudge'), ('aid begins to filter back into yemen, as saudi-led blockade eases', 'npr'), ('just look at these fancy port-a-potties', 'npr'), ('nyt turns to twitter activism to thwart', 'drudge'), ('uncertainty reigns in battle for virginia house of delegates', 'npr'), ('u.s. reverses its decision to close palestinian office in d.c.', 'npr'), ("'i don't believe in science,' says flat-earther set to launch himself in own rocket", 'npr'), ("bosnian war chief 'dies' after being filmed 'drinking poison' at the hague", 'drudge'), ('federal judge blocks new texas anti-abortion law', 'npr'), ('gm unveils driverless cars, aiming to lead pack', 'drudge'), ('in japan, a growing scandal over companies faking product-quality data', 'npr')]


#I want to classify each headline as belonging to a given source. 
def scikit_naivebayes(data_array):
    headlines = [element[0] for element in data_array]
    sources = [element[1] for element in data_array]
    text_clf = Pipeline([('vect', CountVectorizer(stop_words='english')), ('tfidf', TfidfTransformer()),('clf', MultinomialNB())])
    cf1 = text_clf.fit(headlines, sources)
    train(cf1,headlines,sources)

    #Call most_informative_features function on CountVectorizer and classifier
    show_most_informative_features(CountVectorizer, cf1)


def train(classifier, X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=33)
    classifier.fit(X_train, y_train)
    print ("Accuracy: {}".format(classifier.score(X_test, y_test)))


#tobigue's code: 
def show_most_informative_features(vectorizer, clf, n=20):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    top = zip(coefs_with_fns[:n], coefs_with_fns[:-(n + 1):-1])
    for (coef_1, fn_1), (coef_2, fn_2) in top:
    print ("\t%.4f\t%-15s\t\t%.4f\t%-15s" % (coef_1, fn_1, coef_2, fn_2))


def main():
    scikit_naivebayes(array)


main()

#ERROR: 
# File "file_path_here", line 34, in program_name
# feature_names = vectorizer.get_feature_names()
# TypeError: get_feature_names() missing 1 required positional argument: 'self'

Solution

You need to fit the CountVectorizer before calling vectorizer.get_feature_names(). In your code, you only call the other function with the class CountVectorizer, which won't lead to anything.

You should try independtly from your pipeline to create a vectorizer with CountVectorizer, and then call fit on your text, and eventually use the function already provided, though you should further adapt it by yourself to your problem.

You should understand easily that the function you use needs an instanciated object, and not a class. Tell me if you don't.

Edit

coef_ is an attribute only accessible by an estimator, i.e a classifier (and not all). Pipeline is a sklearn object used to combined different steps in order to feed a classifier. Typically, a bag-of-words pipeline is constitued by a feature extractor and a classifier (here logistic regression):

pipeline = Pipeline([
('vectorizer', CountVectorizer(args)),
('classifier', LogisticRegression()
])

So, in your case, you should either avoid using pipeline (what I recommend you to begin), or use get_params() method from the pipeline to access the classifier.

I suggest you to fit_transform the text, then feed the transformed result to a logistic regression or naive bayes classifier, and then call the function you have :

vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(headlines, sources)
naive_bayes = MultinomialNB()
naive_bayes.fit(X, sources)
show_most_informative_features(vectorizer, naive_bayes)

First try that, and if it works you'll understand better how to then use a pipeline. Note that your Pipeline should not work as you combine to feature extractors, the last step should be an estimator. If you want to stack to features extractors, you need to look out for FeatureUnion