I am interested in implementing Q-learning (or some form of reinforcement learning) to find an optimal protocol. Currently, I have a function written in Python where I can take in the protocol or "action" and "state" and returns a new state and a "reward". However, I am having trouble finding a Python implementation of Q-learning that I can use in this situation (i.e. something that can learn the function as if it is a black box). I have looked at OpenAI gym but that would require writing a new environment. Would anyone know of a simpler package or script that I can adopt for this?
My code is of the form:
def myModel (state, action, param1, param2):
...
return (state, reward)
What I am looking for would be an algorithm of the form:
def QLearning (state, reward):
...
return (action)
And some way to keeping the actions that transition between states. If anyone has any idea where to look for this, I would be very excited!
A lot of the comments presented here require you to have deep knowledge of reinforcement learning. It seems that you are just getting started with reinforcement learning, so I would recommend starting with the most basic Q learning algorithm.
The best way to learn RL is to code the basic algorithm for yourself. The algorithm has two parts (model, agent) and it looks like this:
model(state, action):
...
return s2, reward, done
where s2 is the new state the model entered after performing action, a. Reward is based on performing that action at that state. Done simply represents if it is the end of the episode or not. It seems like you have this part already.
The next part is the agent and looks like this:
states = [s1, s2, s3, ...]
actions = [a1, a2, a3, ...]
Q_matrix = np.zeros([state_size, action_size])
discount = 0.95
learning_rate = 0.1
action_list = []
def q_learning_action(s, Q_matrix):
action = index_of_max(Q_matrix[s, :])
action_list.append(action) # Record your action as requested
return action
def q_learning_updating(s, a, reward, s2, Q_matrix):
Q_matrix[s, a] = (1 - learning_rate)Q_matrix[s, a] + learning_rate*(reward + gamma*maxQ_matrix[s2, a])
s = s2
return s, Q_matrix
With this, you can build a RL agent to learn many basic things for optimal control.
Basically, the Q_learning_actions gives you the action required to perform on the environment. Then using that action, calculate the models next state and reward. Then using all the information, update your Q-matrix with the new knowledge.
Let me know if anything doesn't make sense!