Search code examples
tensorflowkeraspytorchreinforcement-learningq-learning

Are there benefits to having Actor and Critic use significantly different models?


In Actor-Critic methods the Actor and Critic are assigned two complimentary, but different goals. I'm trying to understand whether the differences between these goals (updating a policy and updating a value function) are large enough to warrant different models for the Actor and Critic, or if they are of similar enough complexity that the same model should be reused for simplicity. I realize that this could be very situational, but not in what way. For example, does the balance shift as the model complexity grows?

Please let me know if there are any rules of thumb for this, or if you know of a specific publication that addresses the issue.


Solution

  • The empirical results suggest the exact opposite - that it is important to have the same network doing both (up to some final layer/head). The main reason for this is that learning value network (critis) provides signal for shaping represntation of the policy (actor) that otherwise would be nearly impossible to get.

    In fact if you think about these, these are extremely similar goals, since for optimal deterministic policy

    pi(s) = arg max_a Q(s, a) = arg max_a V(T(s, a))
    

    where T is the transition dynamics.