machine-learning reinforcement-learning evolutionary-algorithm

How is the equation in “Evolution Strategies as a Scalable Alternative to Reinforcement Learning” derived?

In the OpenAI paper "Evolution Strategies as a Scalable Alternative to Reinforcement Learning", how is the equation in page 3 derived?

Solution

It's not "derived," in the sense that this equation was not a natural progression from the previous equation presented in the paper.

This formula demonstrates how the authors chose to apply stochastic gradient ascent. It is a mathematical representation of the algorithm they used.

Right below that equation, they explain how it works:

The resulting algorithm repeatedly executes two phases: 1) Stochastically perturbing the parameters of the policy and evaluating the resulting parameters by running an episode in the environment, and 2) Combining the results of these episodes, calculating a stochastic gradient estimate, and updating the parameters.

It might help to re-start the paper from the beginning and read very slowly and carefully. If you come across anything that doesn't make sense, look it up and don't continue reading the paper until you understand what the authors are trying to tell you.