Search code examples
reinforcement-learningq-learningkeras-rl

Questions About Deep Q-Learning


I read several materials about deep q-learning and I'm not sure if I understand it completely. From what I learned, it seems that Deep Q-learning calculates faster the Q-values rather than putting them on a table by using NN to perform a regression, calculating loss and backpropagating the error to update the weights. Then, in a testing scenario, it takes a state and the NN will return several Q-values for each action possible for that state. Then, the action with the highest Q-value will be chosen to be done at that state.

My only question is how the weights are updated. According to this site the weights are updated as follows:

enter image description here

I understand that the weights are initialized randomly, R is returned by the environment, gamma and alpha are set manually, but I dont understand how Q(s',a,w) and Q(s,a,w) are initialized and calculated. Does it seem that we should build a table of Q-values and update them as with Q-learning or they are calculated automatically at each NN training epoch? what I am not understanding here? can somebody explain to me better such an equation?


Solution

  • In Q-Learning, we are concerned with learning the Q(s, a) function which is a mapping between a state to all actions. Say you have an arbitrary state space and an action space of 3 actions, each of these states will compute to three different values, each an action. In tabular Q-Learning, this is done with a physical table. Consider the following case: enter image description here

    Here, we have a Q table for each state in the game (upper left). And after each time step, the Q value for that specific action is updated according to some reward signal. The reward signal can be discounted by some value between 0 and 1.

    In Deep Q-Learning, we disregard the use of tables and create a parametrized "table" such as this: Feed FOrward net Here, all of the weights will form combinations given on the input that should appromiately match the value seen in the tabular case (Still actively researched).

    The equation you presented is the Q-learning update rule set in a gradient update rule.

    • alpha is the step-size
    • R is the reward
    • Gamma is the discounting factor You do inference of the network to retrieve the value of the "discounted future state" and subtract this with the "current" state. If this is unclear, I recommend you to look up boostrapping which is basicly what is happening here.