Reinforce learning - how to teach a neuronal network avoid actions already chosen during the episode?

I built a custom Open AI Gym environment in which I have 13 different actions and and 33 observation items. During an episode every action can be used, but it can be used only once otherwise the episode ends. Thus the maximum lenght of an episode is 13.

I tried to train several neuronal network for this, but so far the NN did not learned it well and it ends much prior the 13rd step. The last layer of the NN is a softmax layer with 13 neurons.

Do you have any idea how an NN would look like which could learn to choose from 13 actions one-by-one?

Kind regards, Ferenc

Solution

At the end I wrote this code:

from keras import backend as K
import tensorflow as tf
def mask_output2(x):
    inp, soft_out = x
    # add a very small value in order to avoid having 0 everywhere
    c = K.constant(0.0000001, dtype='float32', shape=(32, 13))
    y = soft_out + c

    y = Lambda(lambda x: K.switch(K.equal(x[0],0), x[1], K.zeros_like(x[1])))([inp, soft_out])
    y_sum =  K.sum(y, axis=-1)

    y_sum_corrected = Lambda(lambda x: K.switch(K.equal(x[0],0), K.ones_like(x[0]), x[0] ))([y_sum])

    y_sum_corrected = tf.divide(1,y_sum_corrected)

    y = tf.einsum('ij,i->ij', y, y_sum_corrected)
    return y

It simply corrects the sigmoid result in order to clear (set to 0) those neurons where the inp tensor is set to 1 (showing an action already used).