I am currently using Q-Learning to try to teach a bot how to move in a room filled with walls/obstacles. It must start in any place in the room and get to the goal state(this might be, to the tile that has a door, for example). Currently when it wants to move to another tile, it will go to that tile, but I was thinking that in the future I might add a random chance of going to another tile, instead of that. It can only move up, down, left and right. Reaching the goal state yields +100 and the rest of the actions will yield 0.
I am using the algorithm found here, which can be seen in the image bellow.
Now, regarding this, I have some questions:
I've heard also about TD(Temporal Differences), which seems to be represented by the following expression:
Q(a, s) = Q(a, s) * alpha * [R(a, s) + gamma * Max { Q(a', s' } - Q(a, s)]
which for alpha = 1, just seems the one shown first in the picture. What difference does that gamma make, here?
I'm not an expert on the topic, but I'll take a crack at responding directly at your many questions
[BTW, I should get multi +reps for each question!... Just kidding, if I was in "for the SO reps", I'd stay clear from posting which will get a grand total of 20 views with half of these visitors having an rough idea of the concepts at hand]
1) Q-Learning a two-phase thing?
Yes, Q-Learning implies two phases, a learning phase and an action phase. As with many automated learning algorithms it is possible to "keep on learning" while in the action phase.
2) Infinite number of steps for an optimal G matrix? Not sure where the statement requiring an infinite number of learning cycles to learn an optimal Q matrix. To be sure (and unless the alpha and gamma factors are incorrect), the algorithm converges, if only at a possibly very slow rate. This prompts me to skip and comment on your idea of a 300x200 game space, and well... YES!, for such a space, an given the reward model, it will take what seems to infinity to get an "optimal" Q table. Now, it may be possible that mathematically the algorithm never reaches the optimal nivarna, but for practical solutions, working on the asymptote is just good enough.
3) Role of gamma in TD model
This indicates the importance of deferring rewards, on a path (here with your model, literally), towards higher rewards. This generally prevents the algorithm of getting stuck in local maximas of the solution space, at the cost of making learning even slower...
4) Suggestions to help with learning a big maze
At the risk of betraying the nature of Q-Learning, you can start the robot at increasingly further distances from the goal. This will help it improve the Q Matrix in the area of the states which surround the goal first, then leveraging this partially learned Q matrix as the initial state taken, randomly, within an increasing radius from the goal.
Another, riskier, approach (and indeed one that may further belie the true nature of Q-Learning), would be to change the R Matrix to provide increasingly high rewards, at random placed located at a decreasing distance from the goal. The downside to this approach is that it may introduce opportunities of many local maximas in the solution space, where the algorithm may get stuck if the learning rate and other factors are not tweaked properly.
Both of these approaches in particular the latter can be interpreted as a your (the designer) "wiring" in a solution. Other will say that this is merely as way of introducing a dash of DP into the mix...
5) Neural Net (NN) 6) Genetic Algorithm (GA)
No opinion about adding NN or GA into the mix.
I probably made enough of a fool of myself with some of the less-than-mathematically-accurate statement above. ;-)