computer-vision neural-network deep-learning caffe face-recognition

Unable to train/fine-tune with PReLU in caffe

I am working in face recognition with deep neural network. I am using the CASIA-webface database of 10575 classes for training a deep CNN (used by CASIA, see the paper for details) of 10 Convolution, 5 Pooling and 1 Fully Connected layer. For the activation it uses "ReLU" function. I was able to successfully train it using caffe and obtained the desired performance.

My problem is that, I am unable to train/fine-tune the same CNN using "PReLU" activation. At first, I thought that a simple replace of "ReLU" with "PReLU" will do the job. However, none of fine-tuning (from caffemodel which was learned with "ReLU") and learn from scratch strategies worked.

In order to simplify the learning problem, I reduced the training dataset significantly only with 50 classes. However, yet the CNN was unable to learn with "PReLU", whereas it was able to learn with "ReLU".

In order to understand that my caffe works fine with "PReLU", I verified it by running simple networks (with both "ReLU" and "PReLU") using cifar10 data and it worked.

I would like to know from the community if anyone has similar observations. Or if anyone can provide any suggestion to overcome this problem.

Solution

The main difference between "ReLU" and "PReLU" activation is that the latter activation function has a non-zero slope for negative values of input, and that this slope can be learned from the data. It was observed that these properties make the training more robust to the random initialization of the weights.
I used "PReLU" activation for fine-tuning nets that were trained originally with "ReLU"s and I experienced faster and more robust convergence.

My suggestion is to replace "ReLU" with the following configuration

layer {
  name: "prelu"
  type: "PReLU"
  bottom: "my_bottom"
  top: "my_bottom" # you can make it "in-place" to save memory
  param { lr_mult: 1 decay_mult: 0 }
  prelu_param { 
    filler: { type: "constant" val: 0 } 
    channel_shared: false
  }
}

Note that by initializing the negative slope to 0, the "PReLU" activations are in-fact the same as "ReLU" so you start the fine tuning from exactly the same spot as your original net.

Also note that I explicitly set the learning rate and decay rate coefficients (1 and 0 resp.) -- you might need to tweak these params a bit, though I believe setting the decay_weight to any value other than zero is not wise.