Search code examples
tensorflowtensorflow-gradient

How do I calculate subgradients in TensorFlow?


Does the automatic differentiation procedure in TensorFlow compute subgradient whenever needed? If there are many subgradients then which one will be chosen as output?

I am trying to implement the paper in the link https://www.aclweb.org/anthology/P13-1045 which uses recursive neural networks to perform efficient language parsing. The objective function uses hinge loss function to pick the optimal output vectors, which makes the function not differentiable. I used TensorFlow (v1.12) in eager mode to program the model and used the automatic differentiation to compute the gradients. After every batch, I could see the gradient values changing and the accuracy is slightly improved. After a while, it decreases and this process continues. The model does not converge at all for all the hyper-parameter configurations.

Mini batch size : 256, 512, 1024; Regularization parameters - 0.1, 0.01, 0.001; Learning rate - 0.1, 0.01, 0.001; Optimization function - gradient descent, adagrad, adam;

In the paper, they have described how to find subgradient for the optimum function in a very abstract manner, which I have not understood yet. I was of the opinion at the beginning that automatic gradient computation calculates the subgradient. But at this moment, I am starting to doubt so because that seems to be the only variable missing.


Solution

  • Unfortunately, Tensorflow does not computes subgradients, only gradients. As explained here How does tensorflow handle non differentiable nodes during gradient calculation? . To summarize, when computing a partial derivative, if there is a problem of differentiability, Tensorflow simply puts this derivative to be zero.

    As for you having trouble training your model, there are no general rules saying how to tune the hyperparameters, thus, I would suggest to do a grid search on the learning rates (on a few epochs) to find a good initial learning rate which provide good results for one of the optimization algorithms. Usually, ADAM or SGD with momentum provide satisfying results when choosing a right initial learning rate.