python machine-learning deep-learning reinforcement-learning dqn

how should i define the state for my gridworld like environment?

The problem i want to solve is actually not this simple, but this is kind of a toy game to help me solve the greater problem.

so i have a 5x5 matrix with values all equal to 0 :

structure = np.zeros(25).reshape(5, 5)

and the goal is for the agent to turn all values into 1, so i have:

goal_structure = np.ones(25).reshape(5, 5)

i created a class Player with 5 actions to go either left, right, up, down or flip (turn the value 0 to 1 or 1 to 0). For the reward, if the agent changes the value 0 into 1, it gets a +1 reward. if it turns a 1 into 0 in gets a negative reward (i tried many values from -1 to 0 or even -0.1). and if it just goes left, right, up or down, it gets a reward 0.

Because i want to feed the state to my neural net, i reshaped the state as below:

reshaped_structure = np.reshape(structure, (1, 25))

and then i add the normalized position of the agent to the end of this array (because i suppose the agent should have a sense of where it is):

reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
state = reshaped_state

but i dont get any good results! it just like its random!i tried different reward functions, different optimizing algorithms, such as Exeperience replay, target net, Double DQN, duelling, but non of them seem to work! and i guess the problem is with defining the state. Can any one maybe helping me with defining a good state?

Thanks a lot!

ps: this is my step function:

class Player:

def __init__(self):
    self.x = 0
    self.y = 0

    self.max_time_step = 50
    self.time_step = 0
    self.reward_list = []
    self.sum_reward_list = []
    self.sum_rewards = []

    self.gather_positions = []
    # self.dict = {}

    self.action_space = spaces.Discrete(5)
    self.observation_space = 27

def get_done(self, time_step):

    if time_step == self.max_time_step:
        done = True

    else:
        done = False

    return done

def flip_pixel(self):

    if structure[self.x][self.y] == 1:
        structure[self.x][self.y] = 0.0

    elif structure[self.x][self.y] == 0:
        structure[self.x][self.y] = 1

def step(self, action, time_step):

    reward = 0

    if action == right:

        if self.y < y_threshold:
            self.y = self.y + 1
        else:
            self.y = y_threshold

    if action == left:

        if self.y > y_min:
            self.y = self.y - 1
        else:
            self.y = y_min

    if action == up:

        if self.x > x_min:
            self.x = self.x - 1
        else:
            self.x = x_min

    if action == down:

        if self.x < x_threshold:
            self.x = self.x + 1
        else:
            self.x = x_threshold

    if action == flip:
        self.flip_pixel()

        if structure[self.x][self.y] == 1:
            reward = 1
        else:
            reward = -0.1



    self.reward_list.append(reward)

    done = self.get_done(time_step)

    reshaped_structure = np.reshape(structure, (1, 25))
    reshaped_state = np.append(reshaped_structure, (np.float64(self.x/4), np.float64(self.y/4)))
    state = reshaped_state

    return state, reward, done

def reset(self):

    structure = np.zeros(25).reshape(5, 5)

    reset_reshaped_structure = np.reshape(structure, (1, 25))
    reset_reshaped_state = np.append(reset_reshaped_structure, (0, 0))
    state = reset_reshaped_state

    self.x = 0
    self.y = 0
    self.reward_list = []

    self.gather_positions = []
    # self.dict.clear()

    return state

Solution

I would encode the agent position as a matrix like this:

(where the agent is in the middle). Of course you have to flatten this too for the network. So your total state is 50 input values, 25 for the cell states, and 25 for the agent position.

When you encode the position as two floats, then the network has to do work decoding the exact value of the floats. If you use an explicit scheme like the one above, it is very clear to the network exactly where the agent is. This is a "one-hot" encoding for position.

If you look at the atari DQN papers for example, the agent position is always explicitly encoded with a neuron for each possible position.

Note also that a very good policy for your agent is to stand still and constantly flip the state, it makes 0.45 reward per step for doing this (+1 for 0 to 1, -0.1 for 1 to 0, split over 2 steps). Assuming a perfect policy it can only make 25, but this policy will make a 22.5 reward and be very hard to unlearn. I would suggest that the agent gets a -1 for unflipping a good reward.

You mention that the agent is not learning. Might I suggest that you try to simplify as much as possible. First suggestion is - reduce the length of the episode to 2 or 3 steps, and reduce the size of the grid to 1. See if the agent can learn to consistently set the cell to 1. At the same time, simplify your agent's brain as much as possible. Reduce it to just a single output layer - a linear model with an activation. This should be very quick and easy to learn. If the agent does not learn this within 100 episodes, I suspect there is a bug in your RL implementation. If it works you can start to expand the size of the grid, and the size of the network.