Search code examples
pythonpandastensorflowkeraskeras-layer

Creating sequential model using keras/python and CSV file but getting bad accuracy


i want to build a classifier, but i'm having trouble finding sources that can clearly explain keras functions and how to go about doing what i'm trying to do. i want to use the following data:

         0    1    2        3          4       5    6     7
0     Name  TRY  LOC   OUTPUT     TYPE_A   SIGNAL  A-B  SPOT
1    inc 1    2   20   TYPE-1    TORPEDO   ULTRA    A   -21
2    inc 2    3   16   TYPE-2    TORPEDO     ILH    B   -14
3    inc 3    2   20  BLACK47    TORPEDO    LION    A    49
4    inc 4    3   12   TYPE-2  CENTRALPA    LION    A    25
5    inc 5    3   10   TYPE-2      THREE    LION    A   -21
6    inc 6    2   20   TYPE-2        ATF    LION    A   -48
7    inc 7    4    2  NIVEA-1        ATF    LION    B   -23
8    inc 8    3   16  NIVEA-1        ATF    LION    B    18
9    inc 9    3   18  BLENDER  CENTRALPA    LION    B    48
10   inc 10   4   20    DELCO        ATF    LION    B   -26
11   inc 11   3   20    VE248        ATF    LION    B    44
12   inc 12   1   20   SILVER  CENTRALPA    LION    B   -35
13   inc 13   2   20  CALVIN3     SEVENX    LION    B   -20
14   inc 14   3   14  DECK-BT  CENTRALPA    LION    B   -38
15   inc 15   4    4  10-LEVI    BERWYEN     OWL    B   -29
16   inc 16   4   14   TYPE-2        ATF     NOV    B   -31
17   inc 17   4   10     NYNY    TORPEDO     NOV    B    21
18   inc 18   2   20  NIVEA-1  CENTRALPA     NOV    B    45
19   inc 19   3   27   FMRA97    TORPEDO     NOV    B   -26
20   inc 20   4   18   SILVER        ATF     NOV    B   -46

i want to use columns 1, 2, 4, 5, 6, 7 as input and the output would be 3 (OUTPUT).

the code i currently have is:

import os
import pandas as pd
from sklearn.model_selection import train_test_split
import tensorflow as tf
import numpy as np
from sklearn import metrics
from keras.models import Sequential
from keras.layers.core import Dense, Activation
from keras.callbacks import EarlyStopping
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.text import one_hot

df = pd.read_csv("file.csv")

df.drop('Name', axis=1, inplace=True)

obj_df = df.select_dtypes(include=['object']).copy()
# print(obj_df.head())
obj_df["OUTPUT"] = obj_df["OUTPUT"].astype('category')
obj_df["TYPE_A"] = obj_df["TYPE_A"].astype('category')
obj_df["SIGNAL"] = obj_df["SIGNAL"].astype('category')
obj_df["A-B"] = obj_df["A-B"].astype('category')
# obj_df.dtypes
obj_df["OUTPUT_cat"] = obj_df["OUTPUT"].cat.codes
obj_df["TYPE_A_cat"] = obj_df["TYPE_A"].cat.codes
obj_df["SIGNAL_cat"] = obj_df["SIGNAL"].cat.codes
obj_df["A-B_cat"] = obj_df["A-B"].cat.codes
# print(obj_df.head())
df2 = df[['TRY', 'LOC', 'SPOT']]
df3 = obj_df[['OUTPUT_cat', 'TYPE_A_cat', 'SIGNAL_cat', 'A-B_cat']]
df4 = pd.concat([df2, df3], axis=1, sort=False)

target_column = ['OUTPUT_cat']
predictors = list(set(list(df4.columns))-set(target_column))
df4[predictors] = df4[predictors]/df4[predictors].max()
print(df4.describe())

X = df4[predictors].values
y = df4[target_column].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=40)
print(X_train.shape); print(X_test.shape)

model = Sequential()
model.add(Dense(5000, activation='relu', input_dim=6))
model.add(Dense(1000, activation='relu'))
model.add(Dense(500, activation='relu'))
model.add(Dense(1, activation='softmax'))

# Compile the model
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# build the model
model.fit(X_train, y_train, epochs=20, batch_size=150)

i can't figure out why this is the result i'm getting:

Epoch 20/20
56/56 [==============================] - 4s 77ms/step - loss: 0.0000e+00 - accuracy: 1.8165e-04

i also can't seem to find any answers related to this problem. am i using keras functions incorrectly? is it the way i'm coverting object type to integers? assuming there are 1250 outputs, how would i fix the layers? any tips or help would be appreciated. thank you.


Solution

  • As I said in the comments it seems like a clear case of model underfitting - you have too little data for the size of the model itself. Rather than playing around with the sizes of layers, just try SVM or RandomForest classifiers first and see if it's even possible to get any reasonable classification with your data. Also with this amount of data neural network is hardly ever a good choice.

    So do this instead:

    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import accuracy_score
    
    import pandas as pd
    
    df = blablabla # This is your data
    X = df.iloc[:, [i for i in range(8) if i != 3]]
    y = df.iloc[:, 3]
    
    X = pd.get_dummies(X)
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)
    
    rf = RandomForestClassifier(n_estimators=50, min_samples_leaf=5, n_jobs=-1)
    rf.fit(X_train, y_train)
    predictions = rf.predict(X_test)
    
    accuracy = accuracy_score(y_test, predictions)
    

    If this works and can make some predictions then you can go ahead and try to tune your sequential model.

    EDIT: Just read your comment that you have 1250 class labels and 5000 samples in total. This is likely not going to work with most classifiers. Too many classes and too little sample support.