python deep-learning pytorch conv-neural-network bioinformatics

prediction of CNN is limited to certain value

I have ran CNN model with dna sequence and its expression value. However the prediction of model is ranged in certain gene expression value , which is shown by Figure1. My question will be, 1) would the result of prediction improve, if I set different hyperparamter(current learning rate = 1e-5, epoch = 500 ) 2) would it be wise to train with deeper layer, if so is there any rules of thumbs for how many? 3) the training data is inbalanced(Figure 2), would it be recommended to normalize the data before training?

Here is the CNN model I refer from medium :

class DNA_CNN(nn.Module):
    def __init__(self,
                 seq_len,
                 num_filters=32,
                 kernel_size=3):
        super().__init__()
        self.seq_len = seq_len
        
        self.conv_net = nn.Sequential(
            # 4 is for the 4 nucleotides
            nn.Conv1d(4, num_filters, kernel_size=kernel_size),
            nn.ReLU(inplace=True),
            nn.Flatten(),
            nn.Linear(num_filters*(seq_len-kernel_size+1), 1),
        ) 

    def forward(self, xb):
        # reshape view to batch_size x 4channel x seq_len
        # permute to put channel in correct order
        xb = xb.permute(0,2,1) 
        
        #print(xb.shape)
        out = self.conv_net(xb)
        return out
    
    # __FOOTNOTE 1__

Figure 1

Figure 2.

*Gene expression values are measured in tpm

Solution

First, if the log values are normally-distributed between, let's say, [-10, 10] as you can see in your graph, you should consider passing for instance log(X) / 5.0 instead of X in your model to summarize information in a small interval with regularly spaced values. It will hugely help your model to handle your data. An ML model is simply unable to handle values like 1000 and 0.0001 in same time without calling log before.

In a second time, if your model continues to always predict the same values it should be better to add complexity by adding one or two more layers. You should also increase the kernel size and may consider 1-D pooling especially if you have large sequences in order to increase the range of elements looked by each element at the end of the model and get better representation of the sequence.