Search code examples
kerasreinforcement-learningsoftmax

Keras reinforcement training with softmax


A project i am working on has a reinforcement learning stage using the REINFORCE algorithm. The used model has a final softmax activation layer and because of that a negative learning rate is used as a replacement for negative rewards. I have some doubts about this process and can't find much literature on using a negative learning rate.

Does reinforement learning work with switching learning rate between positive and negative? and if not what would be a better approach, get rid of softmax or has keras a nice option for this?

Loss function:

def log_loss(y_true, y_pred):
    '''
    Keras 'loss' function for the REINFORCE algorithm, 
    where y_true is the action that was taken, and updates 
    with the negative gradient will make that action more likely. 
    We use the negative gradient because keras expects training data
    to minimize a loss function.
    '''
    return -y_true * K.log(K.clip(y_pred, K.epsilon(), 1.0 - K.epsilon()))

Switching learning rate:

K.set_value(optimizer.lr, lr * (+1 if won else -1))
learner_net.train_on_batch(np.concatenate(st_tensor, axis=0),
                           np.concatenate(mv_tensor, axis=0))

Update, test results

I ran a test with only positive reinforcement samples, omitting all negative examples and thus the negative learning rate. Winning rate is rising, it is improving and i can safely assume using a negative learning rate is not correct.
anybody any thoughts on how we should implement it?

Update, model explanation

We are trying to recreate AlphaGo as described by DeepMind, the slow policy net:

For the first stage of the training pipeline, we build on prior work on predicting expert moves in the game of Go using supervised learning13,21–24. The SL policy network pσ(a| s) alternates between convolutional layers with weights σ, and rectifier nonlinearities. A final softmax layer outputs a probability distribution over all legal moves a.


Solution

  • Not sure if it the best way but at least i found a way that works.

    for all negative training samples i reuse the network prediction, set the action i want to unlearn to zero and adjust all values to sum up to one again

    i tried several ways to adjust them afterwards but haven't run enough tests to be sure what works best:

    • apply softmax ( action that has to be unlearned gets a nonzero value.. )
    • redistribute old action value over all other actions
    • set all illigal action values to zero and distribute the total removed value
    • distribute value proportional to value of other values

    probably there are several other ways to do so, it might depend on use case what works best and there might be a better way to do so but this one works at least.