I'm new to ML and I've watched few tutorials and made up my database and after few hiccups, everything is working now.
Now I need to make a web app with this model. I found that I can add pickle library for model prep. Problem is I don't know if I've done a good job in model preparation. I want to take information from for columns in my database, which are in X and to get 3 outputs, one of which is Alloy, and the other 2 are Hours in oven and Temper. So, idea is to create flask API from this model prep, and simple html form with some css styling where the user will be able to input 4 mechanical requirements and to get output how to achieve them. Which alloy to use, what temper and how many hours in the ageing oven.
https://github.com/nemanjaKostovski/MLmodel - This is what I have so far... For detailed mini environment, you can check my github repo. Thanks in advance.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import scikitplot as skplt
import seaborn as sns
df = pd.read_csv('Bazaproizvodnjaprofila1.csv')
X = df[['Rm', 'Rp', 'A%', 'Wb', 'Hours in oven' ]].values
y = df[['Alloy']].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=2)
print("whole dataset:", X.shape, y.shape)
print("training set:", X_train.shape, y_train.shape)
print("test set:", X_test.shape, y_test.shape)
model = LogisticRegression(max_iter=6500)
model.fit(X_train, y_train.ravel())
y_pred = model.predict(X_test)
print("accuracy:", accuracy_score(y_test, y_pred))
sns.relplot(data=df, x="Rm", y="A%", hue="Alloy", alpha=0.8)
skplt.metrics.plot_confusion_matrix(
y_test,
y_pred,
figsize=(12,12),
text_fontsize=20,
title_fontsize=20)
import pickle
with open('alloys_model.pkl', 'wb') as file:
pickle.dump(model, file)
Seems alright to me. I've made some minor tweaks to it based on standards and my personal preference. Keep going! Also just a general suggestion: Try to create a holdout validation set or use cross-validation to measure the accuracy or whatever metric it is that you are using to validate your model. Do not jump to or even touch the test set unless you are absolutely certain about your model.
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import scikitplot as skplt
import seaborn as sns
df = pd.read_csv('Bazaproizvodnjaprofila1.csv')
More efficient way using drop, you can use axis = 'columns' too (considering you are using all the features except for the target)
X = df.drop('Alloy', axis=1).values
No need to use double []
y = df['Alloy'].values
Use stratify if you want the test set to have the same distribution as the training set. Default test size split is 0.25, you can change it using test_size
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y, random_state=2)
Just my personal preference - more neat
print(f"whole dataset shape: X: {X.shape}, y: {y.shape}")
print(f"Training set shape: X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"Test set shape: X_test: {X_test.shape}, y_test: {y_test.shape})
model = LogisticRegression(max_iter=6500)
Won't have to use .ravel() without double []
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
The default score metric for classfication models is accuracy. Turns out that the .score() method in the LogisticRegression class directly calls the sklearn.metrics.accuracy_score method.
print(f'Test Accuracy score: {model.score(X_test, y_test)}')
sns.relplot(data=df, x="Rm", y="A%", hue="Alloy", alpha=0.8)
skplt.metrics.plot_confusion_matrix(
y_test,
y_pred,
figsize=(12,12),
text_fontsize=20,
title_fontsize=20)