Search code examples
python-3.xpandaspickleunsupervised-learningdbscan

Result from pickle file throwing wrong results for new observation for DBSCAN clustering


I have build a DBSCAN clustering model, the output result and the result after using the pkl files are not matching

Below, for 1st record the cluster is 0 But after running it from 'pkl' file, it is showing predicted result as [-1]

Dataframe:

        HD         MC             WT         Cluster
        200        Other          4.5        0
        150        Pep            5.6        0
        100        Pla            35         -1
        50         Same           15         0

Code

       ######## Label encoder for column MC ##############
       le = preprocessing.LabelEncoder()
       df['MC encoded'] = le.fit_transform(df['MC'])

       col_1 = ['HD','MC encoded','WT']
       data = df[col_1]
       data = data.fillna(value=0)

       ######### DBSCAN Clustering ##################
       model = DBSCAN(eps=7, min_samples=2).fit(data)
       outliers_df = pd.DataFrame(data)
       print(Counter(model.labels_))

       ######## Predict ###############
       x = model.fit_predict(data)
       df["Cluster"] = x

       ####### Create model pkl files and dump ################
       filename1 = '/model.pkl'
       model_df = open(filename1, 'wb')
       pickle.dump(model,model_df)
       model_df.close()

       ######## Create Encoder pkl files and dump ############
       output = open('/MC.pkl', 'wb')
       pickle.dump(le, output)
       output.close()

       ####### Load the model pkl file ##############
       with open('model.pkl', 'rb') as file:  
       pickle_model = pickle.load(file)


       ########## Load Encoder pkl file ############
       pkl_file = open('MC.pkl', 'rb')
       le_mc = pickle.load(pkl_file) 
       pkl_file.close()


       ######## Function to predict new data ##############
       def testing(HD,MC,WT):
       test = {'HD':[HD],'MC':[MC], 'WT':[WT]} 
       test = pd.DataFrame(test)
       test['MC_encoded'] = le_mc.transform(test['MC'])
       pred_val = pickle_model.fit_predict(test[['HD','MC_encoded',WT]])
       print(pred_val)
       return(pred_val)


       ###### Predict with new observation ###########
       pred_val = testing(200,'Other',4.5)

Resulting cluster

         [-1]

Expected cluster

         [0]

Solution

  • Clustering is not predictive.

    If you want to classify new instances, use a classifier.

    So in my opinion you are using it entirely on the wrong premises...

    Nevertheless, your mistake is that you use the wrong function.

    fit_predict literally means discard the old model, then fit, and return the labels. This is because of a pretty poor design of sklearn that conflates learning algorithms and the resulting models. A model should not have a fit method anymore, a training algorithm not a predict as there is no model yet...

    Now if you fit to a dataset of fewer than min_samples points, they must all be noise (-1) by definition. You meant to use predict only - which does not exist, because DBSCAN does not predict for new data points.