I am just getting start with deep reinforcement learning and i am trying to crasp this concept.
I have this deterministic bellman equation
When i implement stochastacity from the MDP then i get 2.6a
My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:
The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.
I hope someone out there can help on this one!
Best regards Søren Koch
No, the value function V(s_t)
does not depend on the policy. You see in the equation that it is defined in terms of an action a_t
that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.
In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t
. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.
There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t}
with a sum over all actions a
, and inside the sum the probability pi(s_t, a)
of the policy pi
selecting action a
in state s_t
.