python keras reinforcement-learning openai-gym q-learning

Keras Q-learning model performance doesn't improve when playing CartPole

I'm trying to train a deep Q-learning Keras model to play CartPole-v1. However, it doesn't seem to get any better. I don't believe it's a bug but rather my lack of knowledge on how to use Keras and OpenAI Gym properly. I am following this tutorial (https://adventuresinmachinelearning.com/reinforcement-learning-tutorial-python-keras/), which shows how to train a bot to play NChain-v0 (which I was able to follow), but now I am trying to apply what I learned to a more complex environment: CartPole-v1. Here is the code below:

###import libraries
import gym
import numpy as np
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam


###prepare environment
env = gym.make('CartPole-v1') #our environment is CartPole-v1


###make model
model = Sequential()
model.add(Dense(128, input_shape=(env.observation_space.shape[0],), activation='relu'))
model.add(Dense(64, activation='relu'))
model.add(Dense(env.action_space.n, activation='linear'))
model.compile(loss='mse', optimizer=Adam(), metrics=['mae'])


###train model
def train_model(n_episodes=500, epsilon=0.5, decay_factor=0.999, gamma=0.95):
    G_array = []
    for episode in range(n_episodes):
        observation = env.reset()
        observation = observation.reshape(-1, env.observation_space.shape[0])
        epsilon *= decay_factor
        G = 0
        done = False
        while done != True:
            if np.random.random() < epsilon:
                action = env.action_space.sample()
            else:
                action = np.argmax(model.predict(observation))
            new_observation, reward, done, info = env.step(action) #It keeps going left! Why though?
            new_observation = new_observation.reshape(-1, env.observation_space.shape[0])
            target = reward + gamma*np.max(model.predict(new_observation))
            target_vector = model.predict(observation)[0]
            target_vector[action] = target
            model.fit(observation, target_vector.reshape(-1, env.action_space.n), epochs=1, verbose=0)
            observation = new_observation
            G += reward
        G_array.append(G)

    return G_array

G_array = train_model()
print(G_array)

The output for the 'G_array' (the total reward for each game) is the following:

[14.0, 16.0, 18.0, 12.0, 16.0, 14.0, 17.0, 11.0, 11.0, 12.0, 11.0, 15.0, 13.0, 12.0, 12.0, 19.0, 13.0, 9.0, 10.0, 10.0, 11.0, 11.0, 14.0, 11.0, 10.0, 9.0, 10.0, 10.0, 12.0, 9.0, 15.0, 19.0, 11.0, 11.0, 10.0, 11.0, 13.0, 12.0, 13.0, 16.0, 12.0, 14.0, 9.0, 12.0, 20.0, 10.0, 12.0, 11.0, 9.0, 13.0, 13.0, 11.0, 13.0, 11.0, 24.0, 12.0, 11.0, 9.0, 9.0, 11.0, 10.0, 16.0, 10.0, 9.0, 9.0, 19.0, 10.0, 11.0, 13.0, 11.0, 11.0, 14.0, 23.0, 8.0, 13.0, 12.0, 15.0, 14.0, 11.0, 24.0, 9.0, 11.0, 11.0, 11.0, 10.0, 12.0, 11.0, 11.0, 10.0, 13.0, 18.0, 10.0, 17.0, 11.0, 13.0, 14.0, 12.0, 16.0, 13.0, 10.0, 10.0, 12.0, 22.0, 13.0, 11.0, 14.0, 10.0, 11.0, 11.0, 14.0, 14.0, 12.0, 18.0, 17.0, 9.0, 13.0, 12.0, 11.0, 11.0, 9.0, 16.0, 9.0, 18.0, 15.0, 12.0, 16.0, 13.0, 10.0, 13.0, 13.0, 17.0, 11.0, 11.0, 9.0, 9.0, 12.0, 9.0, 10.0, 9.0, 10.0, 18.0, 9.0, 11.0, 12.0, 10.0, 10.0, 10.0, 12.0, 12.0, 20.0, 13.0, 19.0, 9.0, 14.0, 14.0, 13.0, 19.0, 10.0, 18.0, 11.0, 11.0, 11.0, 8.0, 10.0, 14.0, 11.0, 16.0, 11.0, 13.0, 13.0, 9.0, 16.0, 11.0, 12.0, 13.0, 12.0, 11.0, 10.0, 11.0, 21.0, 12.0, 22.0, 12.0, 10.0, 13.0, 15.0, 19.0, 11.0, 10.0, 10.0, 11.0, 22.0, 11.0, 9.0, 26.0, 13.0, 11.0, 13.0, 13.0, 10.0, 10.0, 11.0, 12.0, 18.0, 9.0, 11.0, 13.0, 12.0, 13.0, 13.0, 12.0, 10.0, 11.0, 12.0, 12.0, 17.0, 11.0, 13.0, 13.0, 21.0, 12.0, 9.0, 14.0, 10.0, 15.0, 12.0, 12.0, 14.0, 11.0, 10.0, 14.0, 12.0, 12.0, 11.0, 8.0, 24.0, 9.0, 13.0, 10.0, 14.0, 10.0, 12.0, 13.0, 12.0, 13.0, 13.0, 14.0, 9.0, 17.0, 16.0, 9.0, 16.0, 14.0, 11.0, 9.0, 10.0, 15.0, 11.0, 9.0, 14.0, 12.0, 10.0, 13.0, 10.0, 10.0, 16.0, 15.0, 11.0, 8.0, 9.0, 9.0, 10.0, 9.0, 21.0, 13.0, 13.0, 10.0, 10.0, 11.0, 27.0, 13.0, 15.0, 11.0, 11.0, 12.0, 9.0, 10.0, 16.0, 10.0, 13.0, 13.0, 12.0, 12.0, 11.0, 17.0, 14.0, 9.0, 15.0, 26.0, 9.0, 9.0, 13.0, 9.0, 8.0, 12.0, 9.0, 10.0, 11.0, 9.0, 10.0, 9.0, 11.0, 9.0, 10.0, 12.0, 13.0, 13.0, 11.0, 11.0, 10.0, 15.0, 11.0, 11.0, 13.0, 10.0, 10.0, 12.0, 10.0, 10.0, 12.0, 9.0, 15.0, 29.0, 11.0, 9.0, 18.0, 11.0, 13.0, 13.0, 16.0, 13.0, 15.0, 10.0, 11.0, 18.0, 9.0, 9.0, 11.0, 15.0, 11.0, 11.0, 10.0, 25.0, 10.0, 9.0, 11.0, 15.0, 15.0, 11.0, 11.0, 11.0, 13.0, 9.0, 11.0, 9.0, 13.0, 12.0, 12.0, 14.0, 11.0, 14.0, 8.0, 10.0, 13.0, 10.0, 10.0, 10.0, 9.0, 13.0, 9.0, 12.0, 10.0, 11.0, 9.0, 11.0, 12.0, 20.0, 9.0, 10.0, 14.0, 9.0, 12.0, 13.0, 11.0, 11.0, 11.0, 10.0, 15.0, 14.0, 14.0, 12.0, 13.0, 12.0, 11.0, 10.0, 12.0, 12.0, 9.0, 11.0, 9.0, 11.0, 13.0, 10.0, 11.0, 11.0, 11.0, 12.0, 13.0, 13.0, 12.0, 8.0, 11.0, 13.0, 9.0, 12.0, 10.0, 10.0, 15.0, 12.0, 11.0, 10.0, 17.0, 10.0, 14.0, 9.0, 10.0, 10.0, 10.0, 12.0, 10.0, 10.0, 12.0, 10.0, 15.0, 10.0, 10.0, 9.0, 10.0, 10.0, 10.0, 19.0, 9.0, 10.0, 11.0, 10.0, 11.0, 11.0, 13.0, 10.0, 11.0, 12.0, 11.0, 12.0, 13.0, 11.0, 8.0, 12.0, 12.0, 14.0, 14.0, 11.0, 9.0, 11.0, 9.0, 12.0, 9.0, 8.0, 9.0, 12.0, 8.0, 10.0, 11.0, 13.0, 12.0, 12.0, 10.0, 11.0, 12.0, 10.0, 12.0, 13.0, 9.0, 9.0, 10.0, 15.0, 14.0, 16.0, 8.0, 19.0, 10.0]

This apparently means the model did not improve at all for all 500 episodes. Excuse me if I am a complete beginner at using Keras and OpenAI Gym (especially Keras). Any help is appreciated. Thank you.

UPDATE: Through some debugging, I’ve recently noticed that the model tends to go left, or choose action 0, most of the time. Does that mean I should make some if-statements to modify the reward system (e.g. increase the reward if the pole angle is less than 5 degrees)? In fact, I am doing that right now, but to no avail so far.

Solution

Reinforcement learning is very noisy and your batch size is 1 which makes it even noisier. You can try to use a memory buffer of past episodes/updates which you update. You could use something like deque() from collections for this buffer. Then you randomly sample from this memory buffer according to a given batch-size. I found this repo to be very helpful (it includes a replay/memory buffer and a RL agent as you need it) https://github.com/udacity/deep-reinforcement-learning/tree/master/dqn Nevertheless, RL takes a long time to converge, unlike conventional deep learning where the loss decreases very fast in the beginning, in RL the reward will not increase for a long time and then suddenly start increasing.