Currently, I have a dataset which contains two columns procedure name and their CPTs. For example, Total Knee Arthroplasty-27447, Total Hip Arthroplasty -27130, Open Carpal Tunnel Release-64721. The dataset has 3000 rows and there are total 5 CPT codes(5 classes). I am writing a classification model. When I am passing some wrong input, for example, "open knee arthroplasty carpal tunnel release", it is giving output 64721 which is wrong. Below is the code which I am using. May I know what changes I could make in my code and if choosing a Neural Network for this problem is correct?
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.neural_network import MLPClassifier
xl = pd.ExcelFile("dataset.xlsx") # reading the data
df = xl.parse('Query 2.2')
# shuffling the data
df=df.sample(frac=1)
X_train, X_test, y_train, y_test = train_test_split(df['procedure'], df['code'], random_state = 0,test_size=0.10)
count_vect = CountVectorizer().fit(X_train)
X_train_counts = count_vect.transform(X_train)
tfidf_transformer = TfidfTransformer().fit(X_train_counts)
X_train_tfidf = tfidf_transformer.transform(X_train_counts)
model= MLPClassifier(hidden_layer_sizes=(25),max_iter=500)
classificationModel=model.fit(X_train_tfidf, y_train)
data_to_be_predicted="open knee arthroplasty carpal tunnel release"
result = classificationModel.predict(count_vect.transform([data_to_be_predicted]))
predictionProbablityMatrix = classificationModel.predict_proba(count_vect.transform([data_to_be_predicted]))
maximumPredictedValue = np.amax(predictionProbablityMatrix)
if maximumPredictedValue * 100 > 99:
print(result[0])
else:
print("00000")
I'd recomend you to use Keras for this problem. All the treatment to the data you did using sklearn after splitting the trainning and testing data could be made with numpy to keras and would be more readable and less confusing to know what's going on. If they're all strings you should split the data by rows with internal python code like
row = data[i].split(',')
would have the three columns in the row splitted. If you have 5 knew classes then I'd take all the classes and replace their names for numbers in the dataset. I've never used Sklearn to implement a neural network, but it seems you used 25 hidden NN layers, is that right? I don't think you would need this much as well... think 3 would do the job.
Sorry if I couldn't help you more precisely in your problem, but I think you can solve your problem easier if you redo it like I said... good luck, buddy!
edit: Maybe the problem isn't in the parsed dataset, but in the NN implementation, that's why I think Keras is more clear