I'm trying to understand Q-Learning
The basic update formula:
Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]
I understand the formula, and what it does, but my question is:
How does the agent know to choose Q(st, at)?
I understand that an agent follows some policy π, but how do you create this policy in the first place?
At the moment I have:
However, this doesnt really solve much, you still get stuck in local minimum/maximums.
So, just to round things off, my main question is:
How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take?
That update formula incrementally computes the expected value of each action in every state. A greedy policy chooses always the highest valued action. This is the best policy when you have already learned the values. The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε.