I'm in the process of rewriting the TRACX2 model, a variation of a recurrent neural network used for training encodings in the context of word segmentation from continuous speech or text. The writer of the original code manually wrote the network in Numpy, while I want to optimize it with Pytorch. However, they implement something they call "temperature" and a "fahlman offset":
This clearly isn't the actual derivative of tanh(x), one of their activation functions, but they used this derivative instead. How would I go about implementing this modification in Pytorch?
Basically, you add a backward hook like so:
a = Variable(torch.randn(2,2), requires_grad=True)
m = nn.Linear(2,1)
m(a).mean().backward()
print(a.grad)
# shows a 2x2 tensor of non-zero values
temperature = 0.3
fahlmanOffset = .1
def hook(module, grad_input, grad_output):
# Use custom gradient output
return grad_output * temperature + fahlmanOffset
m.register_backward_hook(hook)
a.grad.zero_()
m(a).mean().backward()
print(a.grad)
# shows a 2x2 tensor with modified gradient
(courtesy of this answer)