Search code examples
pythonscikit-learntext-classificationtfidfvectorizer

Group features of TF-IDF vector in scikit-learn


I'm using scikit-learn to train a text classification model based on TF-IDF feature vector by following piece of code:

model = naive_bayes.MultinomialNB()
feature_vector_train = TfidfVectorizer().fit_transform(X)
model.fit(self.feature_vector_train, Y)

I need to rank the extracted features in decreasing order of their TF-IDF weight and group them into two non-overlapped sets of features and finally train two different classification model. How can I group the main feature vector into an odd-ranked set and an even-ranked set?


Solution

  • The result of your TfidfVectorizer is an n x m matrix n is the number of documents and m is the number of unique words. Thus, each column in feature_vector_train corresponds to a specific word from your dataset. Adapting a solution from this tutorial should allow you to extract the highest and lowest weighted words:

    vectorizer = TfidfVectorizer()
    feature_vector_train = vectorizer.fit_transform(X)
    feature_names = vectorizer.get_feature_names()
    
    total_tfidf_weights = feature_vector_train.sum(axis=0) #this assumes you only want a straight sum of each feature's weight across all documents
    #alternatively, you could use vectorizer.transform(feature_names) to get the values of each feature in isolation
    
    #sort the feature names and the tfidf weights together by zipping them
    sorted_names_weights = sorted(zip(feature_names, total_tfidf_Weights), key = lambda x: x[1]), reversed=True) #the key argument tells sorted according to column 1. reversed means sort from largest to smallest
    #unzip the names and weights
    sorted_features_names, sorted_total_tfidf_weights = zip(*sorted_names_weights)
    

    From this point you should be able to separate the features as you'd like. Once you have them into two groups, group1 and group2, you can separate them into two matrices like this:

    #create a feature_name to column index mapping
    column_mapping = dict((name, i) for i, name, in enumerate(feature_names))
    
    #get the submatrices
    group1_column_indexes = [column_mapping[feat] for feat in group1]
    group1_feature_vector_train  = feature_vector_train[:,group1_column_indexes] #all rows, but only group1 columns
    
    group2_column_indexes = [column_mapping[feat] for feat in group2]
    group2_feature_vector_train  = feature_vector_train[:,group2_column_indexes]