I am trying to use a fully connected neural network or multilayer perceptron to perform a multi-class classification: My training data (X) are different DNA strings of equal length. Each of these sequences have a float point value associated with them (e.g. t_X), which I use to simulate labels (y) for my data in the following way. y ~ np.random.poisson(constant * t_X).
After training my Keras model (please see below), I made a histogram of predicted labels and test labels and the issue I am facing is that my model seems to classify a lot of sequences incorrectly, please see image linked below.
My training data looks like the following:
X , Y
CTATTACCTGCCCACGGTAAAGGCGTTCTGG, 1
TTTCTGCCCGCGGCCTGGCAATTGATACCGC, 6
TTTTTACACGCCTTGCGTAAAGCGGCACGGC, 4
TTGCTGCCTGGCCGATGGTCTATGCCGCTGC, 7
I one-hot encode my Y's and my X sequences are turned into tensors of dimensions: (batch size, sequences length, number of characters), these numbers are something like 10,000 by 50 by 4
My keras model looks like:
model = Sequential()
model.add(Flatten())
model.add(Dense(100, activation='relu',input_shape=(50,4)))
model.add(Dropout(0.25))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(len(one_hot_encoded_labels), activation='softmax'))
I have tried the following different loss functions
#model.compile(loss='mean_squared_error',optimizer=Adam(lr=0.00001), metrics=['accuracy'])
#model.compile(loss='mean_squared_error',optimizer=Adam(lr=0.0001), metrics=['mean_absolute_error',r_square])
#model.compile(loss='kullback_leibler_divergence',optimizer=Adam(lr=0.00001), metrics=['categorical_accuracy'])
#model.compile(loss=log_poisson_loss,optimizer=Adam(lr=0.0001), metrics=['categorical_accuracy'])
#model.compile(loss='categorical_crossentropy',optimizer=Adam(lr=0.0001), metrics=['categorical_accuracy'])
model.compile(loss='poisson',optimizer=Adam(lr=0.0001), metrics=['categorical_accuracy'])
The loss behaves reasonably; it goes down and flattens out with increasing epochs. I have tried different learning rates, different optimizers, different number of neurons in each layer, different number of hidden layers and different types of regularization.
I think that my model always puts most predicted labels around the peak of the test data, (please see linked histogram), but it is unable to classify the sequences with fewer counts in the test set. Is this a common problem?
Without going to other architectures (like convolution or recurrent), does any one know how I might be able to improve classification performance for this model?
From your histogram distributions, it is clear that, you have very imbalanced test data-set. I am assuming, you have same training data distribution. Then it might be the reason, that NN is performing poor, because, it doesn't have much data for many of classes to learn the features. You can try some sampling techniques, so it can compare the relation between each class.
Here is a link, which has explained the various methods for such imbalance data-set.
Second, you can check the model's performance by cross-validation, where you can easily find, whether that is reducible or irreducible error. If that is irreducible error, you can't improve any more(you have to try another method for that situation).
Third, there is co-relation between sequences. Simple feed-forward network cann't capture such relation. Recurrent-network
can capture such dependencies in the data-set. Here is simple example for that. This example is for binary-class, which can be extended to multi-class
as in your case.
For loss-function
selection, it is completely problem specific. You can check this link which has explained when and which loss function can be helpful.