Search code examples
reinforcement-learning

Confused about Rewards in David Silver Lecture 2


While watching the Reinforcement Learning course by David Silver on youtube (and the slide: Lecture 2 MDP), I found the "Reward" and "Value Function" really confusing.

  1. I tried to understand the "given rewards" marked on the slide (P11), but I cannot figure out why it is the case. Like, the "Class 1: R = -2" but "Pub: R = +1"

    why the negative reward for Class and the positive reward for Pub? why the different value?

  2. How to calculate the reward with the Discount Factor? (P17 and P18)

I think the lack of intuition for Reinforcement Learning is the main reason why I have encountered this kind of problem...

So, I'd really appreciate it if someone can give me a little hint.


Solution

  • You usually set the reward and the discount such that using RL you will drive the agent to solve a task. In the student example the goal is to pass the exam. The student can spend his time attending a class, sleeping, on Facebook or at the pub. Attending a class is something "boring", so the student doesn't see the immediate benefits of doing it. Hence the negative reward. On the contrary, going to the pub is fun and gives a positive reward. However, only by attending all 3 classes the student can pass the exam and get the big final reward. Now the question is: how much does the student value immediate vs future rewards? The discount factor tells you that: a small discount gives more importance to immediate rewards, because future rewards just "fade" in the long run. If we use a small discount, the student may prefer to always go to the pub or to sleep. With a discount close to 0, already after one step all rewards get close to 0 as well, so at each state the student will try to maximize the immediate reward, because after that "nothing else matter".

    On the contrary, high discounts (max 1) value long-term rewards more: in this case the optimal student will attend all classes and pass the exam.

    Choosing the discount can be tricky, especially if there is no terminal state (in this case "sleep" is terminal), because with a discount of 1 the agent may ignore the number of steps used to reach the highest reward. For instance, if classes would give a reward of -1 instead of -2, for the agent would be the same to spend time alternating between "class" and "pub" forever and at some point to pass the exam, because with discount 1 the rewards never fade, so even after 10 years the students will still get +10 for passing the exam.

    Think also of a virtual agent having to reach a goal position. With discount 1, the agent would not learn to reach it in the least amount of steps: as long as it reaches it, it's the same for him.

    Beside that, there is also a numerical problem with discount 1. Since the goal is to maximize the cumulative sum of the discounted reward, if rewards are not discounted (and the horizon is infinite) the sum will not converge.