equation policy reinforcement-learning mdp markov-decision-process

State value and state action values with policy - Bellman equation with policy

I am just getting start with deep reinforcement learning and i am trying to crasp this concept.

I have this deterministic bellman equation

When i implement stochastacity from the MDP then i get 2.6a

My equation is this assumption correct. I saw this implementation 2.6a without a policy sign on the state value function. But to me this does not make sense due to i am using the probability of which different next steps i could end up in. Which is the same as saying policy, i think. and if yes 2.6a is correct, can i then assume that the rest (2.6b and 2.6c) because then i would like to write the action state function like this:

The reason why i am doing it like this is because i would like to explain myself from a deterministic point of view to a non-deterministic point of view.

I hope someone out there can help on this one!

Best regards Søren Koch

Solution

No, the value function V(s_t) does not depend on the policy. You see in the equation that it is defined in terms of an action a_t that maximizes a quantity, so it is not defined in terms of actions as selected by any policy.

In the nondeterministic / stochastic case, you will have that sum over probabilities multiplied by state-values, but this is still independent from any policy. The sum only sums over different possible future states, but every multiplication involves exactly the same (policy-independent) action a_t. The only reason why you have these probabilities is because in the nondeterministic case a specific action in a specific state can lead to one of multiple different possible states. This is not due to policies, but due to stochasticity in the environment itself.

There does also exist such a thing as a value function for policies, and when talking about that a symbol for the policy should be included. But this is typically not what is meant by just "Value function", and also does not match the equation you have shown us. A policy-dependent function would replace the max_{a_t} with a sum over all actions a, and inside the sum the probability pi(s_t, a) of the policy pi selecting action a in state s_t.