Search code examples
pythonpandasinverseone-hot-encoding

inverse the binarized dataframe to original categorical values after un-pickling


I am trying to solve a classification problem where the label column contains string values.

Steps followed in Training the model :-

  1. Converted the dataframe to binarized values using pandas.get_dummies.

  2. Trained the Randomforest classifier (scikit) model

  3. Pickled the model

Testing the model:-

  1. Unpickled the model

  2. Passed the test data and got the result from the Radom Forest Classifier

  3. The output is in binarized format

Objective:-

would like to inverse this data to its original string value.

Please suggest if there is a solution.

Note:- Most of the threads in the internet are taking me only till the result from the classifier. Or doing the training and testing it in a single program.


Solution

  • Aside from your problem, use joblib instead of pickle because it is much more efficient to store models such as Random Forest, and now for your problem there are some things to consider:

    Pickling or not, the output of your treatment is the same. Pickling is a way to store your model and once your random forest is unpickled it has the same properties and characteristics as before. It may be the case that you misconcieve your input format or that you do not know how to apply the prediction method. Let's take an example : a DataFrame with 3 categorical variables and a certain class depending on the 3 features.

    import pandas as pd
    from sklearn.ensemble import RandomForestClassifier
    df = pd.read_csv(data='example.csv', columns=['val1', 'val2', 'val3', 'class'])
    

    Now applying one-hot encoding and fitting a Random Forest to "class" column :

    #Turning it into dummies
    dummies = pd.get_dummies(df[['col1', 'col2', 'col3']])
    
    #Random forest
    clf = RandomForestClassifier()
    model = clf.fit(dummies, df.class)
    

    Dumping and loading the model with joblib :

    from sklearn.externals import joblib
    #Dumping
    joblib.dump(clf, 'filename.pkl') 
    
    #Loading
    clf = joblib.load('filename.pkl')
    

    Or with pickle if you stick to it :

    import cPickle
    
    #Dumping
    with open('path/to/file', 'wb') as f:
        cPickle.dump(clf, f)
    
    #Loading
    with open('path/to/file', 'rb') as f:
        clf = cPickle.load(clf)
    

    Now that you reloaded your model, the proper way to obtain a result is to use the predict method to obtain the class from another value. Picture that you have a second DataFrame that has the similar format, except that the class column is missing. You would to it the following way :

    df_test = pd.read_csv("test.csv", columns=['col1', 'col2', 'col3'])
    
    #Creating dummies
    dummie_test = pd.get_dummies(df_test)
    
    #Getting the prediction
    df_test['predicted'] = clf.predict(dummies_test)