Search code examples
reinforcement-learning

Reinforcement learning with non repeatable actions


I am very new to RL and wondering about the capabilities of RL. In my understand RL is a kind of Neural Network that feeding into a state and output the probability of each action. The training process is to mitigate the difference between the predicted value and the value of the real rewards (may be wrong here).

However, my problem is very tricky. In the beginning, there is an action space [x1, x2, x3, x4,..,x5], after each step, the action cannot be repeated. In other words, the action space is shrinking after each iterate. The 'game' is done when the action space is 0. The goal of this game is to get the highest accumulated rewards.

I did some search online but I failed to find any useful information. Thanks so much!

Added: Sorry for the unclear problem. For a classical RL such as CartPole game. The action is repeatable, so the agent is learning the reward of each action at each state. The goal is to get a reward '1' rather than '0'. But for my question, since the game will be done anyway (because the action space is decreasing every iterate and the game is done when action space is empty), I hope the agent can do the action with the highest reward firstly, and then second-highest reward.....

So I believe this is some kind of optimization problem. But I don't know how to modify the classical architecture of RL learning for this problem? Or could anyone help me find some related sources or tutorials?

Added Again: Currently, my solution is to change the part of how an action is picked. I fed a list of previous actions. And avoid these actions to be picked again. For example, if the action is picked by neural network, I made the output of the neural network by

with torch.no_grade():
   action_from_nn = nn(state)
   action_from_nn[actions_already_taken] = 0
   action = torch.max(action_from_nn,1)[1]

if the random value smaller than epsilon, the action is just randomly chosen, it will be:

action = random.sample([i for i in action_space if i not in actions_already_taken], 1)[0]

As can be seen, I only forced the agent not to choose the repeat actions. But I didn't really change the output of the neural network. I am wondering is that OK? Or is there any space for future improvement?


Solution

  • Your action and observation space needs to be the same size throughout a run of training. A way round this for your problem would be to cause any action that has already been performed to no longer have an effect on the game or the observation space. The actions that have been performed could be stored in a 1d array, one-hot encoded and this could also be included in your observation space so your agent will learn to not pick actions it has already performed.