training a tensorflow model on openai cartpole

i am implementing my first reinforcement deep learning model using tensorflow for which i am implementing cartpole problem .

i have resorted to a deep neural network using six layers which trains on dataset generated randomly which has score above a threshold. the problem is that the model is not converging and the final score remains around 10 pts on an average.

as suggested after reading certain posts i applied regularization and dropouts to reduce any over-fitting that may occur but still no luck. i also tried reducing the learning rate.

the accuracy also remains around .60 just after training one batch though loss is decreasing in every iteration which i think it memorizes even after these. though this kind of model works on simple deep learning tasks.

here is my code:

import numpy as np
import tensorflow as tf
import gym
import os
import random

os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'
model_path = "C:/Users/sanka/codes/cart pole problem/tf_save3"
env = gym.make("CartPole-v0")

def train_set():           #training set generation function
        tx = np.load("final_trainx.npy")
        ty = np.load("final_trainy.npy")
        return tx,ty
        tx = []
        ty = []
        for _ in range(10000):
            score = 0
            moves = []
            obs = []
            p = []
            for _ in range(500):
                action = np.random.randint(0, 2)
                observation, reward, done, info = env.step(action)
                if (len(p)==0):
                    p = observation
                    moves += [action]
                    obs += [observation]
                    p = observation
                score += reward
                if done:
            if (score > 50):
                for i in range(len(moves)):
                    ac = moves[i]
                    if (ac == 1):
                        ty.append([0, 1])
                        ty.append([1, 0])
        return tx, ty

weights = {
    1: tf.Variable(tf.truncated_normal([4, 128]), dtype=tf.float32),
    2: tf.Variable(tf.truncated_normal([128, 256]), dtype=tf.float32),
    3: tf.Variable(tf.truncated_normal([256, 512]), dtype=tf.float32),
    4: tf.Variable(tf.truncated_normal([512, 256]), dtype=tf.float32),
    5: tf.Variable(tf.truncated_normal([256, 128]), dtype=tf.float32),
    6: tf.Variable(tf.truncated_normal([128, 2]), dtype=tf.float32)

biases = {
    1: tf.Variable(tf.truncated_normal([128]), dtype=tf.float32),
    2: tf.Variable(tf.truncated_normal([256]), dtype=tf.float32),
    3: tf.Variable(tf.truncated_normal([512]), dtype=tf.float32),
    4: tf.Variable(tf.truncated_normal([256]), dtype=tf.float32),
    5: tf.Variable(tf.truncated_normal([128]), dtype=tf.float32),
    6: tf.Variable(tf.truncated_normal([2]), dtype=tf.float32)

def neural_network(x):
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[1]), biases[1]))
    x = tf.nn.dropout(x, 0.8)
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[2]), biases[2]))
    x = tf.nn.dropout(x, 0.8)
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[3]), biases[3]))
    x = tf.nn.dropout(x, 0.8)
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[4]), biases[4]))
    x = tf.nn.dropout(x, 0.8)
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[5]), biases[5]))
    x = tf.nn.dropout(x, 0.8)
    x = tf.add(tf.matmul(x, weights[6]), biases[6])
    return x

def test_nn(x):
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[1]), biases[1]))
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[2]), biases[2]))
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[3]), biases[3]))
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[4]), biases[4]))
    x = tf.nn.relu(tf.add(tf.matmul(x, weights[5]), biases[5]))
    x = tf.nn.softmax(tf.add(tf.matmul(x, weights[6]), biases[6]))
    return x

def train_nn():
    prediction = neural_network(x)
    loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=prediction, labels=y))
    optimizer = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(loss)
    test_pred = test_nn(x)
    correct = tf.equal(tf.argmax(test_pred, 1), tf.argmax(y, 1))
    accuracy = tf.reduce_mean(tf.cast(correct, dtype=tf.float32))
    with tf.Session() as sess:
        epoches = 5
        batch_size = 100
        for j in range(epoches):
            for i in range(0,len(train_x),batch_size):
                epoch_y = train_y[i:min(i + batch_size, len(train_y))]
                #print("Accuracy is {0}".format(, feed_dict={x: epoch_x, y: epoch_y})))
            print("epoch {0} completed out of {1} with loss {2}".format(j,epoches,ep_loss))
            print("Accuracy is {0}".format(,feed_dict={x:train_x,y:train_y})))

        scores = []
        choices = []
        for each_game in range(10):
            print("game ", each_game)
            score = 0
            game_memory = []
            prev_obs = []
            for _ in range(500):
                if (len(prev_obs) == 0):
                    action = random.randrange(0, 2)
                    x1 = np.array([prev_obs]).reshape(-1,4)
                    a = tf.argmax(test_pred, 1)
                    action =, feed_dict={x: x1})

                new_observation, reward, done, info = env.step(action)
                prev_obs = new_observation
                game_memory.append([new_observation, action])
                score += reward
                if done:


        print('Average Score:', sum(scores) / len(scores))
        print('choice 1:{}  choice 0:{}'.format(choices.count(1) / len(choices), choices.count(0) / len(choices)))



  • So you first collect examples of random trials that did more or less well and then train your model on these examples?

    In a way that's not actually reinforcement learning. You are assuming that the actions taken by the random agent are good and are learning to imitate it. So if you think about it your model actually predicts the action of a random agent 60% of the time. Considering the actions are random and your above 50%, your actually well off.

    You are only able to get above 50% because you choose only the random gameplays that by accident gets above 50 points, and therefore its a non random subset of the games. If you raise the bar to only consider random gameplays that get over 100 points or something like that you should get a better result. In this way you will select gameplays with a lot more good plays than bad.

    If you want to attack the problem in a more reinforcement learning way, that is, learning while you play not from someone else's play. I suggest you take a look at Q-Learning or Policy Learning.

    The main thing to keep in mind is that generally there is not a correct action to take. Maybe different actions lead to the same results. So rather than trying to predict which action is correct given a state, you should try to predict the expect outcome of an action given a state. Then choose the action with the best expected outcome.