python keras tensorflow2.0 reinforcement-learning agent

How to configure Dueling Double DQN input_shape for samples with a shape of (169, 3) each?

TLDR

The input shape of each sample for my DoubleDuelingDQN is (169, 3). The output of that DDDQN shall be of shape (3) for 3 corresponding actions. Currently, when I call

next_qs_list = self.target_network(next_states).numpy()

..the output shape is (64, 169, 3) for batch_size=64. My assumption is, that the output shape is wrong and should be (64, 3). My NN is currently configured like below (where it's call() function returns the wrong shape) - How would I need to build my network to return the correct shape (3) instead of (169,3)?:

class DuelingDeepQNetwork(keras.Model):
    def __init__(self, n_actions, neurons_1, neurons_2, neurons_3=None):
        super(DuelingDeepQNetwork, self).__init__()
        self.dens_1 = keras.layers.Dense(neurons_1, activation='relu', input_dim=(169,31,))  # Here I added input_dim which is not present in my LunarLander Agent
        self.dens_2 = keras.layers.Dense(neurons_2, activation='relu')
        if neurons_3:
            self.dens_3 = keras.layers.Dense(neurons_3, activation='relu')
        self.V = keras.layers.Dense(1, activation=None)  # Value layer
        self.A = keras.layers.Dense(n_actions, activation=None)  # Advantage layer

    def call(self, state):
        x = self.dens_1(state)
        x = self.dens_2(x)
        if self.dens_3:
            x = self.dens_3(x)
        V = self.V(x)
        A = self.A(x)
        Q = V + (A - tf.math.reduce_mean(A, axis=1, keepdims=True))
        return Q

    def advantage(self, state):
        x = self.dens_1(state)
        x = self.dens_2(x)
        if self.dens_3:
            x = self.dens_3(x)
        A = self.A(x)
        return A

Updated error message:

ValueError: non-broadcastable output operand with shape (3,) doesn't match the broadcast shape (3,3)

is raised on the last line of:

for idx, done in enumerate(dones):
    target_qs_list[idx, actions[idx]] = rewards[idx]
    tmp1 = self.gamma * next_qs_list[idx, max_actions[idx]]
    target_qs_list[idx, actions[idx]] += tmp1 * (1-int(dones[idx]))

Initial Post:

I have (kind of) finished my Custom RL Environment respecting the OpenAI Gym concept. Basically the environment is a TimeSeries of OHLCV Cryptoprices and env.reset() returns a windows of shape (169, 31) - 169 TimeSteps and 31 Features. With env.step() the agent's observation window wanders one TimeStep on. I want start with 3 possible actions (do nothing / buy / sell)

self.action_space = spaces.Discrete(3)
self.observation_space = spaces.Box(low=-np.inf, high=np.inf, shape=(self.HISTORY_LENGTH+1, self.df_sample.shape[1]), dtype=np.float32)
# (169, 31)

Now I am failing on the migration of my existing DQN-Agent from LunarLander-v2 (created following multiple Tutorials on Youtube and Medium). I assume that my DQNetwork and/or MemoryBuffer are not formatted correctly. I begin with filling up my memory with 1,000 samples on random actions. Then Training begins and on with the agent.learn() call, the following Error is raised, which I am unable to interpret and which is the reason for this question for help.

TypeError: Only integers, slices (`:`), ellipsis (`...`), tf.newaxis (`None`) and scalar tf.int32/tf.int64 tensors are valid indices, got <tf.Tensor: shape=(3,), dtype=int64, numpy=array([144, 165,   2])>

(deleted obsolete text) To be exactly this update loop raises the error:

for idx, done in enumerate(dones):
    target_qs_list[idx, actions[idx]] = rewards[idx] + self.gamma * next_qs_list[idx, max_actions[idx]] * (1-int(dones[idx]))

Since my debugging skills and knowledge on python+keras+TF have found an end here, I appreciate any help.

Here is the code of my Agent. If more code or information is needed, I will happily provide more info.

class ReplayBuffer():
    def __init__(self, max_mem_size, dims):
        # self.memory = max_mem_size
        self.state_memory = np.zeros((max_mem_size, *dims), dtype=np.float32)  # Here I added "*" to deflate the no 2D observation (169, 31)
        self.action_memory = np.zeros(max_mem_size, dtype=np.int32)
        self.reward_memory = np.zeros(max_mem_size, dtype=np.float32)
        self.new_state_memory = np.zeros((max_mem_size, *dims), dtype=np.float32)  # Here I added "*" to deflate the no 2D observation (169, 31)
        self.done_memory = np.zeros(max_mem_size, dtype=np.int32)
        self.max_mem_size = max_mem_size
        self.mem_counter = 0
        self.index = 0

    def store_transition(self, transition):
        '''
        :param transition: Tuple of transition data (state, action, reward, new_state, done)
        :return: Nothing
        '''
        self.state_memory[self.index] = transition[0]
        self.action_memory[self.index] = transition[1]
        self.reward_memory[self.index] = transition[2]
        self.new_state_memory[self.index] = transition[3]
        self.done_memory[self.index] = transition[4]
        self.mem_counter += 1
        if self.index < self.max_mem_size - 1:
            self.index += 1
        else:
            self.index = 0

    def get_sample_batch(self, batch_size, replace=False):
        '''
        :param batch_size: Number of samples for batch
        :param replace: Wether or not double entries are allowed in returned batch
        :return: Tuples of transition data (state, action, reward, new_state, done)
        '''
        max_size = min(self.mem_counter, self.max_mem_size)
        batch_ids = np.random.default_rng().choice(max_size, batch_size, replace)
        states = self.state_memory[batch_ids]
        actions = self.action_memory[batch_ids]
        rewards = self.reward_memory[batch_ids]
        new_states = self.new_state_memory[batch_ids]
        dones = self.done_memory[batch_ids]
        return states, actions, rewards, new_states, dones


class DuelingDeepQNAgent():
    def __init__(self, lr, gamma, env, batch_size=64, mem_size=1_000_000, update_target_every=50):

        self.n_actions = env.action_space.n
        self.input_dims = env.observation_space.shape  #env.observation_space.shape[0]
        self.action_space = [i for i in range(self.n_actions)]
        self.gamma = gamma
        self.epsilon = 1.0
        self.batch_size = batch_size
        self.memory = ReplayBuffer(max_mem_size=mem_size, dims=self.input_dims)
        self.update_target_every = update_target_every
        self.update_target_counter = 0
        self.learn_step_counter = 0

        # Main model - gets trained every single step()
        self.q_network = DuelingDeepQNetwork(n_actions=self.n_actions, neurons_1=256, neurons_2=256, neurons_3=128)
        self.target_network = DuelingDeepQNetwork(n_actions=self.n_actions, neurons_1=256, neurons_2=256, neurons_3=128)
        self.q_network.compile(optimizer=Adam(learning_rate=lr), loss='mse')
        self.target_network.compile(optimizer=Adam(learning_rate=lr), loss='mse')

    def store_transition(self, transition):
        self.memory.store_transition(transition)

    def choose_action(self, observation):
        if np.random.random() < self.epsilon:
            action = np.random.choice(self.action_space)
        else:
            state = np.array([observation])  # Add in an extra dimension -> quasi hinzufügen einer "batch-dimension"

            q_values = self.q_network.advantage(state)
            action = tf.math.argmax(q_values, axis=1).numpy()[0]
        return action

    def learn(self):
        if self.memory.mem_counter < self.batch_size:
            return
        if self.update_target_counter % self.update_target_every == 0:
            self.target_network.set_weights(self.q_network.get_weights())

        current_states, actions, rewards, next_states, dones = self.memory.get_sample_batch(self.batch_size)

        current_qs_list = self.q_network(current_states)
        next_qs_list = self.target_network(next_states)
        target_qs_list = current_qs_list.numpy()  # ??? From Tensor to Numpy?!
        max_actions = tf.math.argmax(self.q_network(next_states), axis=1)

        # According to Phil: improve on my solution here....
        for idx, done in enumerate(dones):
            target_qs_list[idx, actions[idx]] = rewards[idx] + self.gamma * next_qs_list[idx, max_actions[idx]] * (1-int(dones[idx]))
        self.q_network.fit(current_states, target_qs_list, batch_size=self.batch_size, verbose=0)
        self.learn_step_counter += 1

Solution

You can't use Dense layers on 2D inputs (i.e. inputs that are 3D, with the batch dimension). It will only take the last dimension as its input. Dense layers work with 1D input tensors, provided as a 2D batch.

Either add a Flatten(shape=-1) layer to flatten your input (e.g. to a (batch_size, 169*3) tensor), or use a convolutional network.

Also your self.A layer appears to provide the q-value for each action, given a state. The advantage would be the action values - the value estimate. I think that's what you're calculating in call()

If the flatten doesn't fix your output shape, I'd double check the shapes of the arguments for Q, and the shape of the returned Q. print(Q.shape) will be your friend for debugging!