conv-neural-network pytorch gradient hyperparameters

CNN weights getting stuck

This is a slightly theoretical question. Below is a graph the plots the loss as the CNN is being trained. Y axis is MSE and X axis is number of Epochs.

Description of CNN:

class Net(nn.Module):

def __init__ (self):
    super(Net, self).__init__()
    self.conv1 = nn.Conv1d(in_channels = 1, out_channels = 5, kernel_size = 9) #.double
    self.pool1 = nn.MaxPool1d(3)
    self.fc1 = nn.Linear(5*30, 200)
    #self.dropout = nn.Dropout(p = 0.5)
    self.fc2 = nn.Linear(200, 99)

def forward(self, x):
    x = self.pool1(F.relu(self.conv1(x)))
    x = x.view(-1, 5 * 30)
    #x = self.dropout(F.relu(self.fc1(x)))
    x = F.relu(self.fc1(x))
    x = self.fc2(x)
    return x

def init_weights(m):
    if type(m) == nn.Linear:
       nn.init.xavier_uniform_(m.weight)
       m.bias.data.fill_(0.01)


 net = Net()
 net.apply(init_weights)

 criterion = nn.MSELoss()
 optimizer = optim.Adam(net.parameters(), lr=0.01-0.0001) # depends

Both input and output are an array of numbers. It is a multiregression ouput-problem.

This issue where the loss/weights gets stuck in an incorrect place doesn’t happen as much if I use a lower learning rate. However, it still happens. In some sense that means that the hyper dimensional space that is created by the parameters of the CNN is jagged with a lot of local minimums. This could be true because the CNN ‘s inputs are very similar. Would increasing the layers of the CNN both conv layers and fully connected linears help solve this problem as they hyper dimensional space might be smoother? Or is this intuition completely incorrect? A more broad question when should you be inclined to add more convolutional layers? I know that in practice you should almost never start from scratch and instead use another model’s first few layers. However, the inputs I am using are very different to anything I have found online and therefore cannot do this.

Solution

Is this a multiclass classification problem? If so you could try using cross entropy loss. And a softmax layer before output maybe? I'm not sure because I don't know what's the model's input and output.