i want to build a classifier, but i'm having trouble finding sources that can clearly explain keras functions and how to go about doing what i'm trying to do. i want to use the following data:
0 1 2 3 4 5 6 7
0 Name TRY LOC OUTPUT TYPE_A SIGNAL A-B SPOT
1 inc 1 2 20 TYPE-1 TORPEDO ULTRA A -21
2 inc 2 3 16 TYPE-2 TORPEDO ILH B -14
3 inc 3 2 20 BLACK47 TORPEDO LION A 49
4 inc 4 3 12 TYPE-2 CENTRALPA LION A 25
5 inc 5 3 10 TYPE-2 THREE LION A -21
6 inc 6 2 20 TYPE-2 ATF LION A -48
7 inc 7 4 2 NIVEA-1 ATF LION B -23
8 inc 8 3 16 NIVEA-1 ATF LION B 18
9 inc 9 3 18 BLENDER CENTRALPA LION B 48
10 inc 10 4 20 DELCO ATF LION B -26
11 inc 11 3 20 VE248 ATF LION B 44
12 inc 12 1 20 SILVER CENTRALPA LION B -35
13 inc 13 2 20 CALVIN3 SEVENX LION B -20
14 inc 14 3 14 DECK-BT CENTRALPA LION B -38
15 inc 15 4 4 10-LEVI BERWYEN OWL B -29
16 inc 16 4 14 TYPE-2 ATF NOV B -31
17 inc 17 4 10 NYNY TORPEDO NOV B 21
18 inc 18 2 20 NIVEA-1 CENTRALPA NOV B 45
19 inc 19 3 27 FMRA97 TORPEDO NOV B -26
20 inc 20 4 18 SILVER ATF NOV B -46
i want to use columns 1, 2, 4, 5, 6, 7 as input and the output would be 3 (OUTPUT).
the code i currently have is:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import one_hot
df = pd.read_csv("file.csv")
df.drop('Name', axis=1, inplace=True)
obj_df = df.select_dtypes(include=['object']).copy()
# print(obj_df.head())
obj_df["OUTPUT"] = obj_df["OUTPUT"].astype('category')
obj_df["TYPE_A"] = obj_df["TYPE_A"].astype('category')
obj_df["SIGNAL"] = obj_df["SIGNAL"].astype('category')
obj_df["A-B"] = obj_df["A-B"].astype('category')
# obj_df.dtypes
obj_df["OUTPUT_cat"] = obj_df["OUTPUT"].cat.codes
obj_df["TYPE_A_cat"] = obj_df["TYPE_A"].cat.codes
obj_df["SIGNAL_cat"] = obj_df["SIGNAL"].cat.codes
obj_df["A-B_cat"] = obj_df["A-B"].cat.codes
# print(obj_df.head())
df2 = df[['TRY', 'LOC', 'SPOT']]
df3 = obj_df[['OUTPUT_cat', 'TYPE_A_cat', 'SIGNAL_cat', 'A-B_cat']]
df4 = pd.concat([df2, df3], axis=1, sort=False)
target_column = ['OUTPUT_cat']
predictors = list(set(list(df4.columns))-set(target_column))
df4[predictors] = df4[predictors]/df4[predictors].max()
print(df4.describe())
X = df4[predictors].values
y = df4[target_column].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)
model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=6))
model.add(Dense(1000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='softmax'))
# Compile the model
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# build the model
model.fit(X_train, y_train, epochs=20, batch_size=150)
i can't figure out why this is the result i'm getting:
Epoch 20/20
56/56 [==============================] - 4s 77ms/step - loss: 0.0000e+00 - accuracy: 1.8165e-04
i also can't seem to find any answers related to this problem. am i using keras functions incorrectly? is it the way i'm coverting object type to integers? assuming there are 1250 outputs, how would i fix the layers? any tips or help would be appreciated. thank you.
As I said in the comments it seems like a clear case of model underfitting - you have too little data for the size of the model itself. Rather than playing around with the sizes of layers, just try SVM or RandomForest classifiers first and see if it's even possible to get any reasonable classification with your data. Also with this amount of data neural network is hardly ever a good choice.
So do this instead:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd
df = blablabla # This is your data
X = df.iloc[:, [i for i in range(8) if i != 3]]
y = df.iloc[:, 3]
X = pd.get_dummies(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=5, n_jobs=-1)
rf.fit(X_train, y_train)
predictions = rf.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
If this works and can make some predictions then you can go ahead and try to tune your sequential model.
EDIT: Just read your comment that you have 1250 class labels and 5000 samples in total. This is likely not going to work with most classifiers. Too many classes and too little sample support.