python machine-learning reinforcement-learning

Performance issue with gradient-bandit agent

I am reading Sutton&Barto's "Reinforcement Learning: An Introduction". Trying to test Gradient-bandit agent (chapter 2.7). But performance is extremely low. I've tried:

using baseline = average reward, not using baseline;
alpha = 0.1, 0.2, 0.3, 0.4;
initial preferences H = 0, 1, 10, 100.

Nothing helps.

This is my Python-code for an agent's life step = action selection + params update (self = agent):

# for probabilities calculation:
pref_exps = np.exp(self.params["preferences"])
pref_exps_sum = sum(pref_exps)

# choosing a bandit:
choice_dice = np.random.uniform() * pref_exps_sum
accum_pref_exp = 0
for i, pref_exp in enumerate(pref_exps):
    accum_pref_exp += pref_exp
    if accum_pref_exp >= choice_dice:
        self.chosen_bandit_i = i
        break

# self.reward is filled here:
self.perform_bandit(self.chosen_bandit_i)

# updating baseline:
self.params["lifetime"] += 1
self.params["average_reward"] += 1 / self.params["lifetime"] * (self.reward - self.params["average_reward"])

# updating preferences:
for i, pref_exp in enumerate(pref_exps):
    probability = pref_exp / pref_exps_sum
    if i == self.chosen_bandit_i:
        self.params["preferences"][i] += self.params["alpha"] * (self.reward - self.params["average_reward"]) * (1 - probability)
    else:
        self.params["preferences"][i] -= self.params["alpha"] * (self.reward - self.params["average_reward"]) * probability

This code lead to extremely poor performance (100 agents, each accessing its own 10 1-armed-bandits, testing over 2000 steps), which we can see from the lower plot:

I've seen this post and it seems that its code is effectively equal to mine after fixing the error, which was the reason of that post. But unlike my code, that post's code works properly when rectified!

I can't figure out where I've made a mistake. Can you help me to properly use the full power of Gradient-bandit agent?

Solution

Solved! I am sorry, the problem was outside of the aforementioned code. Correctness of best bandit choice was coded in a wrong way!

An interesting thing: np.choice(a_list) returns numpy.some_type variable! And when you compare this variable to another_list then numpy broadcasts this variable and compares both as array-likes!

That was something I didn't know about / pay attention to, which made the actual error in code unbeknown to me.