Search code examples
scikit-learnsklearn-pandasmultilabel-classification

Inverse transform function is not returning correct value


I am following tutorial for multi labeling movie genre from https://www.analyticsvidhya.com/blog/2019/04/predicting-movie-genres-nlp-multi-label-classification/

I am using that tutorial to create prediction tag for complaint register. In my case, I am labeling 'Genre' for Complaint Register such as 1 complaint can have many label/tag of Genre). For example: Complaint #1 has multi Genre = Warranty, Air Conditioning.

I am up to the stage where I am invoking multilablebinarizer() function to label the movie 'Genre'

My issue is as following:

The total unique Genre = 55 (Please see screenshot below) image.png

I ran Multilabel_binarizer function and transform "Genre" target variable into y.

Questions:

  1. I encounter y only has (166,49). If my understanding is correct, there is only 49 Genre as opposed to 55 unique Genre

  2. I encounter error message: C:\Users\LAUJ3\Documents\Python Project\env\lib\site-packages\sklearn\multiclass.py:74: UserWarning: Label not 47 is present in all training examples. warnings.warn("Label %s is present in all training examples." %

  3. The inverse_transfrom function of multilabel_binarizer result does not make sense. Expected to see the Genre label instead of Gibberish multilabel_binarizer.inverse_transform(y_pred)[3]

    y_pred[3] Out[57]: array([1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0])

    multilabel_binarizer.inverse_transform(y_pred)[3] Out[58]: (' ', ',', 'a', 'c', 'e', 'g', 'i', 'n', 'o', 'r', 't')

I don't know what went wrong. Thanks for your help in advance.

Screenshot


Solution

  • from sklearn.preprocessing import MultiLabelBinarizer
    
    mlb =  MultiLabelBinarizer()
    mlb.fit_transform(df['genre'])
    
    print(mlb.classes_)
    #op
    [' ' '"' '&' "'" ',' '-' '/' '0' '1' '2' '3' '4' '5' '6' '7' '8' '9' ':'
    'A' 'B' 'C' 'D' 'E' 'F' 'G' 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' 'P' 'Q' 'R'
    'S' 'T' 'V' 'W' 'Z' '[' '\\' ']' '_' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i'
    'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' '{'
    '}']
    

    you are getting character as class, because the content of df['genre'] is string

    #printing type of df['genre']
    print(type(df['genre'][0]))
    #op
    <class 'str'>
    

    convert genre column into dict and pull out value

    df['genre'] = df['genre'].apply(lambda x :[value for value in eval(x).values()])
    print(type(df['genre'][0]))
    #op
    <class 'list'>
    

    now you can apply MultilabelBinarizer do df['genre'] column, and now inverse_transform will work for you

    mlb.fit_transform(df['genre'])
    print(mlb.classes_[0:10]) # taking only 10 element from array since there is 363 different classes
    
    #op
    array(['Absurdism', 'Acid western', 'Action', 'Action Comedy',
       'Action Thrillers', 'Action/Adventure', 'Addiction Drama', 'Adult',
       'Adventure', 'Adventure Comedy'], dtype=object)
    

    updated code

    #replace  df['genre'] = df['genre'].apply(lambda x :[value for value in eval(x).values()])
    df['Genre'] = df['Genre'].apply(lambda x: x.split(',')) 
    mlb.fit_transform(df1['Genre'])
    
    print(mlb.classes_)
    #op
    array([' Curtain/Blinds', ' Delays', ' Electricial Compliance',
       ' Granny Flat', ' Heating/Cooling', ' Payment', ' Refund',
       ' Unlicensed', ' Warranty', 'Airconditioning', 'Heating/Cooling',
       'Warranty'], dtype=object
    

    in earlier data it with string with dictionary format, but in your data string is comma separated, you don't need to use eval function simple split will work for you