I have build a DBSCAN clustering model, the output result and the result after using the pkl files are not matching
Below, for 1st record the cluster is 0 But after running it from 'pkl' file, it is showing predicted result as [-1]
Dataframe:
HD MC WT Cluster
200 Other 4.5 0
150 Pep 5.6 0
100 Pla 35 -1
50 Same 15 0
Code
######## Label encoder for column MC ##############
le = preprocessing.LabelEncoder()
df['MC encoded'] = le.fit_transform(df['MC'])
col_1 = ['HD','MC encoded','WT']
data = df[col_1]
data = data.fillna(value=0)
######### DBSCAN Clustering ##################
model = DBSCAN(eps=7, min_samples=2).fit(data)
outliers_df = pd.DataFrame(data)
print(Counter(model.labels_))
######## Predict ###############
x = model.fit_predict(data)
df["Cluster"] = x
####### Create model pkl files and dump ################
filename1 = '/model.pkl'
model_df = open(filename1, 'wb')
pickle.dump(model,model_df)
model_df.close()
######## Create Encoder pkl files and dump ############
output = open('/MC.pkl', 'wb')
pickle.dump(le, output)
output.close()
####### Load the model pkl file ##############
with open('model.pkl', 'rb') as file:
pickle_model = pickle.load(file)
########## Load Encoder pkl file ############
pkl_file = open('MC.pkl', 'rb')
le_mc = pickle.load(pkl_file)
pkl_file.close()
######## Function to predict new data ##############
def testing(HD,MC,WT):
test = {'HD':[HD],'MC':[MC], 'WT':[WT]}
test = pd.DataFrame(test)
test['MC_encoded'] = le_mc.transform(test['MC'])
pred_val = pickle_model.fit_predict(test[['HD','MC_encoded',WT]])
print(pred_val)
return(pred_val)
###### Predict with new observation ###########
pred_val = testing(200,'Other',4.5)
Resulting cluster
[-1]
Expected cluster
[0]
Clustering is not predictive.
If you want to classify new instances, use a classifier.
So in my opinion you are using it entirely on the wrong premises...
Nevertheless, your mistake is that you use the wrong function.
fit_predict
literally means discard the old model, then fit
, and return the labels. This is because of a pretty poor design of sklearn that conflates learning algorithms and the resulting models. A model should not have a fit
method anymore, a training algorithm not a predict
as there is no model yet...
Now if you fit
to a dataset of fewer than min_samples
points, they must all be noise (-1) by definition. You meant to use predict
only - which does not exist, because DBSCAN does not predict for new data points.