Search code examples
reinforcement-learningopenai-gympython-3.10simpyreward

Why is the mean reward per episode of my PPO and DQN decreasing over time?


I am training an RL agent to optimise dispatching in a job shop manufacturing system. My approach is based on this code: https://github.com/AndreasKuhnle/SimRLFab. It migrates the environment to a gymnasium environment and updates the Python version from Python 3.6 to 3.10. I am testing different algorithms such as PPO, TRPO and DQN. During training I noticed that the the mean reward per episode, the ep_re_mean in my tensorboard, decreases over time contrary to my expectation that it should be increasing. The reward function is the utilization rate of the machines and should be maximised. What could be the reason for this behaviour?

I am using a "self-made" gym environment and a simpy environment. As I am no considering myself as an expert, I thought it looks like it learns to minimize the reward, although it should not. Am I right with this thought? As far as I understand this, the utilization should be maximised, which is why it is positive and calculated as r_util = exp(util/1.5) - 1

The ep_rew_mean diagram from tensorboard: ep_rew_mean diagram from tensorboard

The losses from tensorboard. It seems to learn at least something. Although, I am not sure if it learns a wrong thing. loss and policy gradient loss from tensorboard value loss from tensorboard

The step function, calling the calculation of the reward function is:

def step(self, actions):
        reward = None
        terminal = False
        states = None
        truncated = False
        info = {}
        self.step_counter += 1

        # print(self.counter, "Agent-Action: ", int(actions))

        if (self.step_counter % self.parameters['EXPORT_FREQUENCY'] == 0 or self.step_counter % self.max_episode_timesteps == 0) \
                and not self.parameters['EXPORT_NO_LOGS']:
            self.export_statistics(self.step_counter, self.count_episode)

        if self.step_counter == self.max_episode_timesteps:
            print("Last episode action ", datetime.now())
            truncated = True

        # If multiple transport agents then for loop required
        for agent in Transport.agents_waiting_for_action:
            agent = Transport.agents_waiting_for_action.pop(0)

            if self.parameters['TRANSP_AGENT_ACTION_MAPPING'] == 'direct':
                agent.next_action = [int(actions)]
            elif self.parameters['TRANSP_AGENT_ACTION_MAPPING'] == 'resource':
                agent.next_action = [int(actions[0]), int(actions[1])]
            agent.state_before = None

            self.parameters['continue_criteria'].succeed()
            self.parameters['continue_criteria'] = self.env.event()

            self.env.run(until=self.parameters['step_criteria'])  # Waiting until action is processed in simulation environment
            # Simulation is now in state after action processing

            reward, terminal = agent.calculate_reward(actions)

            if terminal:
                print("Last episode action ", datetime.now())
                self.export_statistics(self.step_counter, self.count_episode)

            agent = Transport.agents_waiting_for_action[0]
            states = agent.calculate_state()  # Calculate state for next action determination

            if self.parameters['TRANSP_AGENT_ACTION_MAPPING'] == 'direct':
                self.statistics['stat_agent_reward'][-1][3] = [int(actions)]
            elif self.parameters['TRANSP_AGENT_ACTION_MAPPING'] == 'resource':
                self.statistics['stat_agent_reward'][-1][3] = [int(actions[0]), int(actions[1])]
            self.statistics['stat_agent_reward'][-1][4] = round(reward, 5)
            self.statistics['stat_agent_reward'][-1][5] = agent.next_action_valid
            self.statistics['stat_agent_reward'].append([self.count_episode, self.step_counter, round(self.env.now, 5),
                                                         None, None, None, states])

            # done = truncated or terminal
            #if truncated:
                #self.reset()

            return states, reward, terminal, truncated, info

The reward function is calculated like this:

 def calculate_reward(self, action):
        result_reward = self.parameters['TRANSP_AGENT_REWARD_INVALID_ACTION']  # = 0.0
        result_terminal = False
        if self.invalid_counter < self.parameters['TRANSP_AGENT_MAX_INVALID_ACTIONS']:  # If true, then invalid action selected
            if self.parameters['TRANSP_AGENT_REWARD'] == "valid_action":
                result_reward = get_reward_valid_action(self, result_reward)
            elif self.parameters['TRANSP_AGENT_REWARD'] == "utilization":
                result_reward = get_reward_utilization(self, result_reward)
        else:
            self.invalid_counter = 0
            result_reward = 0.0
            # result_terminal = True

        if self.next_action_valid:
            self.invalid_counter = 0
            self.counter_action_subsets[0] += 1
            if self.next_action_destination != -1 and self.next_action_origin != -1 and self.next_action_destination.type == 'machine':
                self.counter_action_subsets[1] += 1
            elif self.next_action_destination != -1 and self.next_action_origin != -1 and self.next_action_destination.type == 'sink':
                self.counter_action_subsets[2] += 1
        # If explicit episode limits are set in configuration
        if self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT'] > 0:
            result_reward = 0.0
            if (self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT_TYPE'] == 'valid' and self.counter_action_subsets[0] == self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT']) or \
                (self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT_TYPE'] == 'entry' and self.counter_action_subsets[1] == self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT']) or \
                (self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT_TYPE'] == 'exit' and self.counter_action_subsets[2] == self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT']) or \
                (self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT_TYPE'] == 'time' and self.env.now - self.last_reward_calc_time > self.parameters['TRANSP_AGENT_REWARD_EPISODE_LIMIT']):
                result_terminal = True
                self.last_reward_calc_time = self.env.now
                self.invalid_counter = 0
                self.counter_action_subsets = [0, 0, 0]
            if result_terminal:
                if self.parameters['TRANSP_AGENT_REWARD_SPARSE'] == "utilization":
                    result_reward = get_reward_sparse_utilization(self)
                elif self.parameters['TRANSP_AGENT_REWARD_SPARSE'] == "waiting_time":
                    result_reward = get_reward_sparse_waiting_time(self)
                elif self.parameters['TRANSP_AGENT_REWARD_SPARSE'] == "valid_action":
                    result_reward = get_reward_sparse_valid_action(self)
        else:
            self.last_reward_calc_time = self.env.now
        self.latest_reward = result_reward
        return result_reward, result_terminal

def get_reward_utilization(transport_resource, invalid_reward):
    result_reward = invalid_reward
    if transport_resource.next_action_destination == -1 or transport_resource.next_action_origin == -1:  # Waiting or empty action selected
        result_reward = transport_resource.parameters['TRANSP_AGENT_REWARD_WAITING_ACTION'] # = 0.0
    elif transport_resource.next_action_valid:
        util = 0.0
        for mach in transport_resource.resources['machines']:
            util += mach.get_utilization_step() # calculation of utilization of machines
        util = util / transport_resource.parameters['NUM_MACHINES']
        transport_resource.last_reward_calc = util
        result_reward = np.exp(util / 1.5) - 1.0
        if transport_resource.next_action_destination.type == 'machine':
            result_reward = transport_resource.parameters['TRANSP_AGENT_REWARD_SUBSET_WEIGHTS'][0] * result_reward  # here the weight is = 1.0
        else:
            result_reward = transport_resource.parameters['TRANSP_AGENT_REWARD_SUBSET_WEIGHTS'][1] * result_reward # here the weight is = 1.0
    return result_reward

The reset function looks like this:

def reset(self):
        print("####### Reset Environment #######")

        self.count_episode += 1
        self.step_counter = 0

        if self.count_episode == self.parameters['CHANGE_SCENARIO_AFTER_EPISODES']:
            self.change_production_parameters()

        print("Sim start time: ", self.statistics['sim_start_time'])

        # Setup and start simulation
        if self.env.now == 0.0:
            print('Run machine shop simpy environment')
            self.env.run(until=self.parameters['step_criteria'])

        obs = np.array(self.resources['transps'][0].calculate_state())
        info = {}
        return obs, info

I already tried to check on the reward function, but as far as I understand, it works how I expect it to work.. Also, I checked if the reward transferred to the tensorboard is similar to the reward in my logging files. I read the post here Why does ep_re_mean decrease over time?, but it did not help me.. Does anyone have any idea why the mean reward per episode decreases over time? Note: I can provide more code if needed. Thanks in advance!

EDIT: My full code can be found here: JSP_Environment


Solution

  • The reason for the decreasing mean episodic reward is due to the way I designed the observation space e.g. if instead of 13 different observations I only provide insight into the total processing time as part of the observation space, the average episodic reward increases. If I use all 13 observations, the average episodic reward decreases. Hence the design of the state space causes the problem.