I am in the process of experimenting with deep reinforcement learning and have created the following in environment in which I am running a simulation of purchasing a raw material. Start qty is the amount of the material I have when looking to purchase for the next 12 weeks(sim_weeks). I have to purchase in multiples of 195000 pounds and am expected to use 45000 pounds of material a week.
start_qty= 100000
sim_weeks = 12
purchase_mult = 195000
#days on hand cost =
forecast_qty = 45000
class ResinEnv(Env):
def __init__(self):
# Actions we can take: buy 0, buy 1x,
self.action_space = Discrete(2)
# purchase array space...
self.observation_space = Box(low=np.array([-1000000]), high=np.array([1000000]))
# Set start qty
self.state = start_qty
# Set purchase length
self.purchase_length = sim_weeks
#self.current_step = 1
def step(self, action):
# Apply action
#this gives us qty_available at the end of the week
self.state-=forecast_qty
#see if we need to buy
self.state += (action*purchase_mult)
#now calculate the days on hand from this:
days = self.state/forecast_qty/7
# Reduce weeks left to purchase by 1 week
self.purchase_length -= 1
#self.current_step+=1
# Calculate reward: reward is the negative of days_on_hand
if self.state<0:
reward = -10000
else:
reward = -days
# Check if shower is done
if self.purchase_length <= 0:
done = True
else:
done = False
# Set placeholder for info
info = {}
# Return step information
return self.state, reward, done, info
def render(self):
# Implement viz
pass
def reset(self):
# Reset qty
self.state = start_qty
self.purchase_length = sim_weeks
return self.state
I am debating on if the reward function is sufficient. What I am attempting to do is minimize the sum of the days on hand from each step where the days on hand for a given step is defined by days in the code. I decided that since the goal is to maximize the reward function, then I could convert the days on hand value to a negative number and then use that new negative number as the reward (thus maximizing the reward would minimize the days on hand). Then I added the strong penalty for letting qty available at any given week be negative.
Is there a better way to do this? I am new to this subject and also new to Python in general. Any advice is greatly appreciated! I
I think you should consider reducing the scale of the rewards. Check here and here for stabilising the training in the neural networks. If your only task for the RL agent is to minimise the sum of days on hand, then the reward system makes sense. Just needs a bit of normalisation!