Search code examples
machine-learningpolicyagentreinforcement-learningq-learning

Reinforcement Learning - How does an Agent know which action to pick?


I'm trying to understand Q-Learning

The basic update formula:

Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]

I understand the formula, and what it does, but my question is:

How does the agent know to choose Q(st, at)?

I understand that an agent follows some policy π, but how do you create this policy in the first place?

  • My agents are playing checkers, so I am focusing on model-free algorithms.
  • All the agent knows is the current state it is in.
  • I understand that when it performs an action, you update the utility, but how does it know to take that action in the first place.

At the moment I have:

  • Check each move you could make from that state.
  • Pick whichever move has the highest utility.
  • Update the utility of the move made.

However, this doesnt really solve much, you still get stuck in local minimum/maximums.

So, just to round things off, my main question is:

How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take?


Solution

  • That update formula incrementally computes the expected value of each action in every state. A greedy policy chooses always the highest valued action. This is the best policy when you have already learned the values. The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε.