Search code examples
pythontextclassificationhierarchy

A question on text classification with more than one level of category


I am trying to produce a series of product classifiers based on the text description that each product has. The data frame I have is similar to the following but is more complicated. Python and the sklearn library are used.

data = {'description':['orange', 'apple', 'bean', 'carrot','pork','fish','beef'],
        'level1':['plant', 'plant', 'plant', 'plant','animal','animal','animal'],
         'level2:['fruit','fruit','vegatable','vegatable','livestock', 'seafood','livestock'}
  
# Create DataFrame
df = pd.DataFrame(data)

"Description" is the textual data. Now it is only a word. But the real one is a longer sentence. "Level1" is the top category. "Level2" is a sub-category.

I know how to train a classification model to classify the products into Level 1 categories by using the sklearn library.

Below is what I did:

import pandas as pd
import numpy as np
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score
import pickle

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(df['description'],
                                                 df[['Level1','Level2']], test_size = 0.4, shuffle=True)

#use the TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(use_idf=True)

#transforming the training data into tf-idf matrix
X_train_vectors_tfidf = tfidf_vectorizer.fit_transform(X_train)

#transforming testing data into tf-idf matrix
X_test_vectors_tfidf = tfidf_vectorizer.transform(X_test)

#Create and save model for level 1
naive_bayes_classifier = MultinomialNB()
model_level1 = naive_bayes_classifier.fit(X_train_vectors_tfidf, y_train['Level1'])
with open('model_level_1.pkl','wb') as f:
    pickle.dump(model_level1, f)

What I don't know how to do is to build a classification model for each Level 1 category that can predict the products' Level 2 category. For example, based on the above dataset, there should be one classification model for 'plant' (to predict fruit or vegetable) and another model for 'animal' (to predict seafood or livestock). Do you have any ideas to do it and save the models by using loops?


Solution

  • Assuming you will be able to get all the columns of the dataset then it would be a mix of features with Levels being the class labels. Formulating on the same lines:

    cols = ["abc", "Level1", "Level2", "Level3"]
    

    From this now let's take only levels because that is what we are interested in.

    level_cols = [val for val in levels if "Lev" in val]
    

    The above just check for the presence of "Lev" starts with these three characters.

    Now, with level cols in place. I think you could do the following as a starting point:

    1. Iterate only the level cols.
    2. Take only the numbers 1,2,3,4....n
    3. If step-2 is divisible by 2 then I do the prediction using the saved level model. Ideally, all the even ones.
    4. Else train on other levels. 
    
    for level in level_cols:
        if int(level[-1]) % 2 == 0:
          #  open the saved model at int(level[-1]) - 1
          #  Perform my prediction
        else:
            level_idx = int(level[-1])
            model = naive_bayes_classifier.fit(x_train, y_train[level])
            mf = open("model-x-"+level_idx, "wb")
            pickle.dump(model, mf)