Search code examples
pythontext-classificationmulticlass-classification

Multiclass Text Classification in Python


I am trying to create a Multiclass Text Classifier as explained here. However, my code is breaking at line:

NB_pipeline.fit(X_train, train[category])

Below is the error which I am getting:

File "pandas\hashtable.pyx", line 683, in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12322)

I tried to find out what train[category] returns and I got same error.

1) X_train is a dataframe with one column and contains customer feedback.

2) train is a dataframe with two columns; first column contains customer review(same as X_train) and second column contains one of the 5 categories (Systems Error, Proactive Communication, Staff Behaviour, Website Functionalities, Others).

3) category is one of the above mentioned categories.

Below is the sample train dataframe:

Index           Feedback                                    Category
  0           While making payment got system error.         System error
              Staff behaviour was good at hotel

  1           While making payment got system error.         Staff Behaviour
              Staff behaviour was good at hotel

Solution

  • This is one of the most over-looked issue.

    The reason for this error is that the "column" script is looking for is not available in the dataframe. All the 5 categories you have, should be columns in the input dataframe and rows will take 1/0 if one of the categories is applicable for the feedback/comment. Ideally, Your input dataframe should look like this.

    Index           Feedback                                  System error    Staff Behaviour
      0           While making payment got system error.         1                  1
                  Staff behaviour was good at hotel
    
      1           While making payment got system error.         1                  0
    
      2           Staff behaviour was good at hotel              0                  1
    

    I have used same comment to show how input dataframe should look like.