I am trying to solve a classification problem where the label column contains string values.
Converted the dataframe to binarized values using pandas.get_dummies.
Trained the Randomforest classifier (scikit) model
Pickled the model
Unpickled the model
Passed the test data and got the result from the Radom Forest Classifier
The output is in binarized format
would like to inverse this data to its original string value.
Please suggest if there is a solution.
Note:- Most of the threads in the internet are taking me only till the result from the classifier. Or doing the training and testing it in a single program.
Aside from your problem, use joblib instead of pickle because it is much more efficient to store models such as Random Forest, and now for your problem there are some things to consider:
Pickling or not, the output of your treatment is the same. Pickling is a way to store your model and once your random forest is unpickled it has the same properties and characteristics as before. It may be the case that you misconcieve your input format or that you do not know how to apply the prediction method. Let's take an example : a DataFrame with 3 categorical variables and a certain class depending on the 3 features.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
df = pd.read_csv(data='example.csv', columns=['val1', 'val2', 'val3', 'class'])
Now applying one-hot encoding and fitting a Random Forest to "class" column :
#Turning it into dummies
dummies = pd.get_dummies(df[['col1', 'col2', 'col3']])
#Random forest
clf = RandomForestClassifier()
model = clf.fit(dummies, df.class)
Dumping and loading the model with joblib :
from sklearn.externals import joblib
#Dumping
joblib.dump(clf, 'filename.pkl')
#Loading
clf = joblib.load('filename.pkl')
Or with pickle if you stick to it :
import cPickle
#Dumping
with open('path/to/file', 'wb') as f:
cPickle.dump(clf, f)
#Loading
with open('path/to/file', 'rb') as f:
clf = cPickle.load(clf)
Now that you reloaded your model, the proper way to obtain a result is to use the predict method to obtain the class from another value. Picture that you have a second DataFrame that has the similar format, except that the class column is missing. You would to it the following way :
df_test = pd.read_csv("test.csv", columns=['col1', 'col2', 'col3'])
#Creating dummies
dummie_test = pd.get_dummies(df_test)
#Getting the prediction
df_test['predicted'] = clf.predict(dummies_test)