probability environment policy openai-gym

In OpenAI's gym, what does the 'prob' return from an environment relate to?

Going through David Silverman's lectures and now trying to do some exercises to cement the knowledge, I have found that I don't understand what the probability returned actually refers to. In policy evaluation, we find

$v_{k+1}(s) = \sum_{a\in A} \pi(a|s)(R_s^a + \gamma\sum_{s'\in S}P^a_{ss'}v_k(s'))$

And I have successfully implemented this in Python for the gridworld environment;


def policy_eval(policy, env, discount_factor=1.0, theta=0.00001):
    V = np.zeros(env.nS)
    while True:
        delta = 0
        for state in range(env.nS):
            v = 0
            for action in range(env.nA):
                for prob, next_state, reward, done in env.P[state][action]:
                    v += policy[state][action] * prob * (reward + discount_factor * V[next_state])
            delta = max(delta, abs(v - V[state]))
            V[state] = v

        if delta < theta:
            break
    return np.array(V)

I know policy[state][action] is the probability of doing that action in that state and reward is the reward of taking that action in that state, the other two and self-explanatory. I do not see how prob fits in and what it even does/refers to.

Solution

After some more playing around with gym and value iteration, I have found that the prob return relates to the probability that doing some action causes some event that we do not control. The example that showed me what it really means is the gamblers problem. We have a gambler with some amount of money, and he wins if he gets to $100, otherwise he must place a bet between 0 and the amount of money he has. If the game being played is heads or tails, then the prob is 0.5, and indicates that 50% of the time we double our bet, and 50% of the time we lose everything that we bet. Thus, we have no control over what happens after an action (where an action is placing a bet).

Hope this helps someone else who is encountering the same dilemma