python python-3.x tensorflow tensorflow2.0 reinforcement-learning

Why discounted reward function is reversed?

I'm solving an MIT lab on Reinforcement Learning and am stuck on the reward function. The particular code block is this: https://colab.research.google.com/github/aamini/introtodeeplearning/blob/master/lab3/solutions/RL_Solution.ipynb#scrollTo=5_Q2OFYtQ32X&line=19&uniqifier=1

A simpler version of the relevant code is:

import numpy as np

rewards=[0.,0,0,0,0,1]
discounted_rewards = np.zeros_like(rewards)
R = 0
for t in reversed(range(0, len(rewards))):
    # update the total discounted reward
    R = R * .95 + rewards[t]
    discounted_rewards[t] = R
discounted_rewards

Which gives output as:

array([0.77378094, 0.81450625, 0.857375, 0.9025, 0.95 ,1.])

The provided explanation is that we want to encourage having rewards sooner rather than later. How does using reversed in the for loop help with that ?

Solution

Reversed is necessary so that each reward is multiplied x times by the discount factor where x is the number of timesteps that the reward is away from the present. Besides, since it´s a cumulative reward it adds the next reward to the prior reward. This wouldn´t be able without reverse.

With the reverse, the last reward is the first reward that will be added to R and then in each iteration as the loop continues it will be multiplied by 0.95 for the number of timesteps that occurred before the reward event.

What the loop does is this:

R = 0
R += 0.95 ** 5 * 1
R += 0.95 ** 4 * 0
R += 0.95 ** 3 * 0
R += 0.95 ** 2 * 0
R += 0.95 ** 1 * 0
R += 0

Edit:

The output you get is the cumulative discounted reward. The first index in your output list means that your agent has at that timestep a cumulated discounted reward of 0.7737 for the following action-state tuples. Then further you go into the future (increase the list index) your discounted reward will be higher since you´re approaching the net reward of 1 (winning the game).