Understanding higher order automatic differentiation

Having recently just finished my own basic reverse mode AD for machine learning purposes, I find myself wanting to learn about the field, but I've hit a hardness wall with higher order methods.

The basic reverse AD is beautifully simple and easy to understand, but the more advanced material is both too abstract, too technical and I have not been able to find any good explanations of it on the Internet (in fact it took me quite a bit to realize basic reverse AD even exists.)

Basically, I understand how to take the second derivatives in the context of calculus, but I do not understand how to transform a reverse AD graph to get second order derivatives.

In an algorithm like edge_pushing just what do those dashed connections mean?

I've investigated the library DiffSharp and I've noted that it uses something like forward-on-reverse differentiation for calculating the Hessian. Running, that through the debugger, I've really seen that it does in fact mix forward and reverse steps in a single run. What are the principles behind that mechanism?

DiffSharp uses the jacobian-vector product to calculate the Hessian, for each variable, which is a R^m -> R^n mapping. How is that possible to get that from the original graph? Reverse AD is a R -> R^n mapping, from where do the extra dimensions come from?

Lastly, how does nested AD work?

Solution

I wrote a tutorial for AD that shows briefly how to do forward along with reverse here near the end. I also wrote an entire library for basic AD on the GPU that can be found linked at the same site.

Still not sure about edge_pushing, but I do not think it matters much for neural nets at any rate.