machine-learning policy agent reinforcement-learning q-learning

Reinforcement Learning - How does an Agent know which action to pick?

I'm trying to understand Q-Learning

The basic update formula:

Q(st, at) += a[rt+1, + d.max(Q(st+1, a)) - Q(st,at)]

I understand the formula, and what it does, but my question is:

How does the agent know to choose Q(st, at)?

I understand that an agent follows some policy π, but how do you create this policy in the first place?

My agents are playing checkers, so I am focusing on model-free algorithms.
All the agent knows is the current state it is in.
I understand that when it performs an action, you update the utility, but how does it know to take that action in the first place.

At the moment I have:

Check each move you could make from that state.
Pick whichever move has the highest utility.
Update the utility of the move made.

However, this doesnt really solve much, you still get stuck in local minimum/maximums.

So, just to round things off, my main question is:

How, for an agent that knows nothing and is using a model-free algorithm, do you generate an initial policy, so it know which action to take?

Solution

That update formula incrementally computes the expected value of each action in every state. A greedy policy chooses always the highest valued action. This is the best policy when you have already learned the values. The most common policy for use during learning is the ε-greedy policy, which chooses the highest valued action with probability 1-ε, and a random action with probability ε.