python tensorflow reinforcement-learning

How to deal with different state space size in reinforcement learning?

I'm working in A2C reinforcement learning where my environment has an increasing and decreasing in the number of agents. As a result of the increasing and decreasing the number of agents, the state space will also change. I have tried to solve the problem of changing the state space this way:

If the state space exceeds the maximum state space that selected as n_input, the excess state space will be selected by np.random.choice where random choice provides a way of creating random samples from the state space after converting the state space into probabilities.

If the state space is less than the maximum state I padded the state space with zeros.

def get_state_new(state):
 n_features =  n_input-len(get_state(env))
 # print("state",len(get_state(env)))
 p = np.array(state)
 p = np.exp(p)
 if p.sum() != 1.0:
     p = p * (1. / p.sum())
 if len(get_state(env)) > n_input:
     statappend = np.random.choice(state, size=n_input, p=p)
     # print(statappend)
 else:
     statappend = np.zeros(n_input)
     statappend[:state.shape[0]] = state
 return statappend

It works but the results are not as expected and I don't know if this correct or not.

My question

Are there any reference papers that deal with such a problem and how to deal with the changing of state space?

Solution

I solve the problem using different solutions but I found that the encoding is the best solution for my problem

Select the model with pre-estimate maximum state space and If the state space is less than the maximum state, we padded the state space with zeros
Consider only the state of the agents itself without any sharing of the other state.
As the paper [1] mentioned that the extra connected autonomous vehicles (CAVs) are not included in the state and if they are less than the max CAVs, the state is padded with zeros. We can select how many agents that we can share their state adding to the agent’s state.
Encode the state where it will help us to process the input and compress the information into a fixed length. In the encoder, every cell in the LSTM layer or RNN with Gated Recurrent Units (GRU) returns a hidden state (Ht) and cell state (E’t).

For the encoder, I use the Neural machine translation with attention code

class Encoder(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, enc_units, batch_sz):
    super(Encoder, self).__init__()
    self.batch_sz = batch_sz
    self.enc_units = enc_units
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(self.enc_units,
                                   return_sequences=True,
                                   return_state=True,
                                   recurrent_initializer='glorot_uniform')

  def call(self, x, hidden):
    x = self.embedding(x)
    output, state = self.gru(x, initial_state = hidden)
    return output, state

  def initialize_hidden_state(self):
    return tf.zeros((self.batch_sz, self.enc_units))

LSTM zero paddings and mask where we pad the state with a special value to be masked (skipped) later. If we pad without masking, the padded value will be regarded as actual value, thus, it becomes noise in the state [2-4].

1- Vinitsky, E., Kreidieh, A., Le Flem, L., Kheterpal, N., Jang, K., Wu, C., ... & Bayen, A. M. (2018, October). Benchmarks for reinforcement learning in mixed-autonomy traffic. In Conference on Robot Learning (pp. 399-409)

2- Kochkina, E., Liakata, M., & Augenstein, I. (2017). Turing at semeval-2017 task 8: Sequential approach to rumour stance classification with branch-lstm. arXiv preprint arXiv:1704.07221.

3- Ma, L., & Liang, L. (2020). Enhance CNN Robustness Against Noises for Classification of 12-Lead ECG with Variable Length. arXiv preprint arXiv:2008.03609.

4- How to feed LSTM with different input array sizes?

5- Zhao, X., Xia, L., Zhang, L., Ding, Z., Yin, D., & Tang, J. (2018, September). Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems (pp. 95-103).