Structuring reward function for Open AI RL environment for raw material purchasing

I am in the process of experimenting with deep reinforcement learning and have created the following in environment in which I am running a simulation of purchasing a raw material. Start qty is the amount of the material I have when looking to purchase for the next 12 weeks(sim_weeks). I have to purchase in multiples of 195000 pounds and am expected to use 45000 pounds of material a week.

start_qty= 100000
sim_weeks = 12
purchase_mult = 195000
#days on hand cost =
forecast_qty = 45000


class ResinEnv(Env):
    def __init__(self):
        # Actions we can take: buy 0, buy 1x,
        self.action_space = Discrete(2)
        # purchase array space...
        self.observation_space = Box(low=np.array([-1000000]), high=np.array([1000000]))
        # Set start qty
        self.state = start_qty
        # Set purchase length
        self.purchase_length = sim_weeks
        #self.current_step = 1
        
    def step(self, action):
        # Apply action
        #this gives us qty_available at the end of the week
        self.state-=forecast_qty
        
        #see if we need to buy
        self.state += (action*purchase_mult)
       
        
        #now calculate the days on hand from this:
        days = self.state/forecast_qty/7
        
        
        # Reduce weeks left to purchase by 1 week
        self.purchase_length -= 1 
        #self.current_step+=1
        
        # Calculate reward: reward is the negative of days_on_hand
        if self.state<0:
            reward = -10000
        else:
            reward = -days
        
        # Check if shower is done
        if self.purchase_length <= 0: 
            done = True
        else:
            done = False
        
        # Set placeholder for info
        info = {}
        
        # Return step information
        return self.state, reward, done, info

    def render(self):
        # Implement viz
        pass
    
    def reset(self):
        # Reset qty
        self.state = start_qty
        self.purchase_length = sim_weeks
        
        return self.state

I am debating on if the reward function is sufficient. What I am attempting to do is minimize the sum of the days on hand from each step where the days on hand for a given step is defined by days in the code. I decided that since the goal is to maximize the reward function, then I could convert the days on hand value to a negative number and then use that new negative number as the reward (thus maximizing the reward would minimize the days on hand). Then I added the strong penalty for letting qty available at any given week be negative.

Is there a better way to do this? I am new to this subject and also new to Python in general. Any advice is greatly appreciated! I

Solution

I think you should consider reducing the scale of the rewards. Check here and here for stabilising the training in the neural networks. If your only task for the RL agent is to minimise the sum of days on hand, then the reward system makes sense. Just needs a bit of normalisation!