python deep-learning reinforcement-learning q-learning dqn

DQN understanding input and output (layer)

I have a question about the input and output (layer) of a DQN.

e.g

Two points: P1(x1, y1) and P2(x2, y2)

P1 has to walk towards P2

I have the following information:

Current position P1 (x/y)
Current position P2 (x/y)
Distance to P1-P2 (x/y)
Direction to P1-P2 (x/y)

P1 has 4 possible actions:

Up
Down
Left
Right

How do I have to setup the input and output layer?

4 input nodes
4 output nodes

Is that correct? What do I have to do with the output? I got 4 arrays with 4 values each as output. Is doing argmax on the output correct?

Edit:

Input / State:

# Current position P1
state_pos = [x_POS, y_POS]
state_pos = np.asarray(state_pos, dtype=np.float32)
# Current position P2
state_wp = [wp_x, wp_y]
state_wp = np.asarray(state_wp, dtype=np.float32)
# Distance P1 - P2 
state_dist_wp = [wp_x - x_POS, wp_y - y_POS]
state_dist_wp = np.asarray(state_dist_wp, dtype=np.float32)
# Direction P1 - P2
distance = [wp_x - x_POS, wp_y - y_POS]
norm = math.sqrt(distance[0] ** 2 + distance[1] ** 2)
state_direction_wp = [distance[0] / norm, distance[1] / norm]
state_direction_wp = np.asarray(state_direction_wp, dtype=np.float32)
state = [state_pos, state_wp, state_dist_wp, state_direction_wp]
state = np.array(state)

Network:

def __init__(self):
    self.q_net = self._build_dqn_model()
    self.epsilon = 1 

def _build_dqn_model(self):
    q_net = Sequential()
    q_net.add(Dense(4, input_shape=(4,2), activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
    q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
    rms = tf.optimizers.RMSprop(lr = 1e-4)
    q_net.compile(optimizer=rms, loss='mse')
    return q_net

def random_policy(self, state):
    return np.random.randint(0, 4)

def collect_policy(self, state):
    if np.random.random() < self.epsilon:
        return self.random_policy(state)
    return self.policy(state)

def policy(self, state):
    # Here I get 4 arrays with 4 values each as output
    action_q = self.q_net(state)

Solution

Adding input_shape=(4,2) in the first Dense layer is causing the output shape to be (None, 4, 4). Defining q_net the following way solves it:

q_net = Sequential()
q_net.add(Reshape(target_shape=(8,), input_shape=(4,2)))
q_net.add(Dense(128,  activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(128, activation='relu', kernel_initializer='he_uniform'))
q_net.add(Dense(4, activation='linear', kernel_initializer='he_uniform'))
rms = tf.optimizers.RMSprop(lr = 1e-4)
q_net.compile(optimizer=rms, loss='mse')
return q_net

Here, q_net.add(Reshape(target_shape=(8,), input_shape=(4,2))) reshapes the (None, 4, 2) input to (None, 8) [Here, None represents the batch shape].

To verify, print q_net.output_shape and it should be (None, 4) [Whereas in the previous case it was (None, 4, 4)].

You also need to do one more thing. Recall that input_shape does not take batch shape into account. What I mean is, input_shape=(4,2) expects inputs of shape (batch_shape, 4, 2). Verify it by printing q_net.input_shape and it should output (None, 4, 2). Now, what you have to do is - add a batch dimension to your input. Simply you can do the following:

state_with_batch_dim = np.expand_dims(state,0)

And pass state_with_batch_dim to q_net as input. For example, you can call the policy method you wrote like policy(np.expand_dims(state,0)) and get an output of dimension (batch_shape, 4) [in this case (1,4)].

And here are the answers to your initial questions:

Your output layer should have 4 nodes (units).
Your first dense layer does not necessarily have to have 4 nodes (units). If you consider the Reshape layer, the notion of nodes or units does not fit there. You can think of the Reshape layer as a placeholder that takes a tensor of shape (None, 4, 2) and outputs a reshaped tensor of shape (None, 8).
Now, you should get outputs of shape (None, 4) - there, the 4 values represent the q-values of 4 corresponding actions. No need to do argmax here to find the q-values.