Search code examples
pythonreinforcement-learning

What does non-stationarity mean and how to implement it in reinforcement learning as 10 arm bandit problem?


I have started learning reinforcement learning and referring the book by Sutton. I was trying to understand the non-stationary environment which was quoted in the book as:

suppose the bandit task were nonstationary, that is, that the true values of the actions changed over time. In this case exploration is needed even in the deterministic case to make sure one of the nongreedy actions has not changed to become better than the greedy on

This tells me that the true expected rewards value given an action changes with time. But does this mean with every time step ? I could clearly understand that how we track the rewards in such case i.e. by weighting the recent ones more than previous ones for every time step. However, does this also mean or give indication that target or true values change with every time step? I am trying to simulate the 10arm bandit problem with the same fig given below where we compare Upper Confidence-Bound Action Selection and Epsilon-greedy methods with sample average method for estimating action values in stationary environment. Fig 2.3 of book with stationary environment

if I have to simulate the same with non_stationary environment then how can I do that ? Below is my code :

class NArmedBandit:

    #10-armed bandit testbed with sample averages 
    def __init__(self,k=10,step_size = 0.1,eps = 0,UCB_c = None, sample_avg_flag = False,
                 init_estimates = 0.0,mu = 0, std_dev = 1):
        self.k = k
        self.step_size = step_size
        self.eps = eps
        self.init_estimates = init_estimates
        self.mu = mu
        self.std_dev = std_dev
        self.actions = np.zeros(k)
        self.true_reward = 0.0
        self.UCB_c = UCB_c
        self.sample_avg_flag = sample_avg_flag
        self.re_init()


    def re_init(self):

        #true values of rewards for each action
        self.actions = np.random.normal(self.mu,self.std_dev,self.k) 

        # estimation for each action
        self.Q_t = np.zeros(self.k) + self.init_estimates

        # num of chosen times for each action
        self.N_t = np.zeros(self.k)

        #best action chosen
        self.optim_action = np.argmax(self.actions)

        self.time_step = 0


    def act(self):
        val = np.random.rand()
        if val < self.eps:
            action = np.random.choice(np.arange(self.k))
            #print('action 1:',action)
        elif self.UCB_c is not None:
            #1e-5 is added so as to avoid division by zero
            ucb_estimates = self.Q_t + self.UCB_c * np.sqrt(np.log(self.time_step + 1) / (self.N_t + 1e-5))
            A_t = np.max(ucb_estimates)
            action = np.random.choice(np.where(ucb_estimates == A_t)[0])
        else:
            A_t = np.max(self.Q_t)
            action = np.random.choice(np.where(self.Q_t == A_t)[0])
            #print('action 2:',action)
        return action



    def step(self,action):

        # generating the reward under N(real reward, 1)
        reward = np.random.randn() + self.actions[action]
        self.time_step += 1
        self.N_t[action] += 1


        # estimation with sample averages
        if self.sample_avg_flag == True:
            self.Q_t[action] += (reward - self.Q_t[action]) / self.N_t[action]
        else:
            # non-staationary with constant step size 
            self.Q_t[action] += self.step_size * (reward - self.Q_t[action])

        return reward


    def play(self,tasks,num_time_steps):
        rewards = np.zeros((tasks, num_time_steps))
        optim_action_counts = np.zeros(rewards.shape)
        for task in trange(tasks):
            self.re_init()
            for t in range(num_time_steps):
                action = self.act()
                reward = self.step(action)
                rewards[task, t] = reward
                if action == self.optim_action:
                    optim_action_counts[task, t] = 1
        avg_optim_action_counts = optim_action_counts.mean(axis=0)
        avg_rewards = rewards.mean(axis=0)
        return avg_optim_action_counts, avg_rewards

Should I change the actions array (which are assumed as true estimates) defined in re_init() function by calling re_init() function after every time step in play() which is like changing true expected rewards for every action at each time step. I have already incorporated the code for calculating the rewards in case of non-stationary environment in act() and step() functions which are using constant step size alpha = 0.1. The only thing that I don't know is how do set up or simulate the non-stationary environment here and if it is correctly understood by me.


Solution

  • You understand correctly about non-stationary. As you understand "that the true values of the actions changed over time."

    But, how they change?

    Actually, it is not clearly defined. Your re_init approach is correct from my perspective. What you need to decide when they change. But one thing is clear, if you change rewards every time step, there is nothing to be learned since you are changing all to-be-learned rewards every step. I can offer two solutions for satisfying non-stationary definition.

    1. you call re_init with small probability of eps for every 100 or 1000 steps.

    2. you can start with initial values and add small random +/- values to your initial values. Then your rewards will drift in positive or negative direction.