Currently I'm trying to implement the REINFORCE policy gradient method (with neural network) for a game. Now obviously, there are certain actions that are invalid in certain states (can't fire the rocket launcher if you don't have one!).
I tried to mask the softmax outputs (action probability) so that is only samples from valid actions. This works fine (or so it seems), however after several iterations of training, these actions are no longer being chosen (all outputs for these nodes turn into 0 for certain input combination). Interestingly, certain action node (invalid action) seems to give 1 (100% probability) in these cases.
This is causing a huge problem since I will then have to resort to randomly choosing the action to perform, which obviously doesn't do well. Is there any other ways to deal with the problem?
P.S. I'm updating the network by setting the "label" as the chosen action node having the value of discounted reward, while the remaining actions to be 0, then doing a categorical_crossentropy in Keras.
I ended up using 2 different approaches, but they both follow the methodology of applying invalid action masks.
One is to use the mask after obtaining the softmax values from policy gradient, then normalize the probability of the remaining actions and sample from those.
The second method is applying the mask after the logit layer, which is simpler and seems to have a better effect (although I didn't do any quantitative measurement to prove this).