Search code examples
pythonscikit-learntf-idffeature-selectiontfidfvectorizer

How to get best features for tf-idf classifiers?


I have a list of comments (text) that I have to classify with some classifiers (input). I'm using a pipeline to do this, and I do KFold because the dataset is very small. I would like to know the names of the best features for the classifiers with SelectKBest, but since it is in the pipeline I don't know how I can get the best feature names.

comments is a list of strings .

def classify(classifiers, folder="tfidf-classifiers"):
    comments = get_comments()
    labels = get_labels()

    tfidf_vector = TfidfVectorizer(tokenizer=tokenizer, lowercase=False)
    stats = {}
    for i in classifiers:
        classifier = i()
        pipe = Pipeline(
            [('vectorizer', tfidf_vector), ('feature_selection', SelectKBest(chi2)), ('classifier', classifier)])

        result = cross_val_predict(pipe, comments, labels, cv=KFold(n_splits=10, shuffle=True))

        cm = confusion_matrix(result, labels, [information, non_information])
        saveHeatmap(cm, i.__name__, folder)

        report = classification_report(labels, result, digits=3, target_names=['no', 'yes'], output_dict=True)

        stats[i.__name__] = report
    return stats

I searched on the internet and found this :

 pipe.named_steps['feature_selection'].get_support()

But I can't do this, since I'm not calling fit on the pipeline. I use the pipeline here:

 result = cross_val_predict(pipe, comments, labels, cv=KFold(n_splits=10, shuffle=True))

How can I get the best K feature names?

What I want is a simple list of the words that "helped most" the classifiers doing their job...


Solution

  • from NLP in Python: Obtain word names from SelectKBest after vectorizing

    from sklearn.feature_extraction.text import TfidfVectorizer
    vectorizer = TfidfVectorizer()
    X = vectorizer.fit_transform(df["Notes"])
    
    from sklearn.feature_selection import chi2
    chi2score = chi2(X,df['AboveAverage'])[0]
    
    wscores = zip(vectorizer.get_feature_names(),chi2score)
    wchi2 = sorted(wscores,key=lambda x:x[1]) 
    topchi2 = zip(*wchi2[-20:])
    show=list(topchi2)
    

    You can easily change the scoring with f_classif or others.