Search code examples
pythonscikit-learnnlptext-classification

How to have multioutput in text classification?


I'm doing dialect text classification. The problem is some tweets, can be classified as both dialect A and B, how can I do that? I want to do it and then automatically calculate the accuracy, I don't want to do it manually. When I don't classify them as both A and B, it gives me many misclassified texts.

In the training though, they're not classified as both dialect A and B. but separately.


Solution

  • Make use of OneHotEncoding

    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import OneHotEncoder
    
    # Your target will look similar to
    target = ['A', 'A', 'B']
    
    # After OneHotEncoding
    [[1, 0],
     [1, 0],
     [0, 1]]
    

    After training on this target, your model will predict the probability of the class. You can set a threshhold to classify the prediction to both the classes

    # Sample output
    [[1., 0.],
     [0.5, 0.5],
     [0.1, 0.9]]
    
    predictions = ['A', 'A and B', 'B']
    

    Example