deep-learning neural-network pytorch reinforcement-learning openai-gym

Policy Network returning different outputs for batched states and individual states

I am implementing REINFORCE applied to the CartPole-V0 openAI gym environment. I am trying 2 different implementations of the same, and the issue I am not able to resolve is the following:

Upon passing a single state to the Policy Network, I get an output Tensor of size 2, containing the action probabilities of the 2 actions. However, when I pass a `batch of states' to the Policy Network to compute the output action probabilities of all of them, the values that I obtain are very different from when each state is individually passed to the network.

Can someone help me understand the issue?

My code for the same is below: (Note: this is NOT the complete REINFORCE algorithm -- I am aware that I need to compute the loss from the probabilities. But I am trying to understand the difference in the computation of the two probabilities, which I think should be the same, before proceeding.)

# architecture of the Policy Network
class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, n_actions):
        super().__init__()
        self.n_actions = n_actions
        self.model = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.ReLU(),
            nn.Linear(64, n_actions),
            nn.Softmax(dim=0)
        ).float()

    def forward(self, X):
        return self.model(X)


def train_reinforce_agent(env, episode_length, max_episodes, gamma, visualize_step, learning_rate=0.003):

    # define the parametric model for the Policy: this is an instantiation of the PolicyNetwork class
    model = PolicyNetwork(env.observation_space.shape[0], env.action_space.n)
    # define the optimizer for updating the weights of the Policy Network
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)


    # hyperparameters of the reinforce agent
    EPISODE_LENGTH = episode_length
    MAX_EPISODES = max_episodes
    GAMMA = gamma
    VISUALIZE_STEP = max(1, visualize_step)
    score = []


    for episode in range(MAX_EPISODES):
        # reset the environment
        curr_state = env.reset()
        done = False
        episode_t = []


        # rollout an entire episode from the Policy Network
        pred_vals = []
        for t in range(EPISODE_LENGTH):
            act_prob = model(torch.from_numpy(curr_state).float())
            pred_vals.append(act_prob)
            action = np.random.choice(np.array(list(range(env.action_space.n))), p=act_prob.data.numpy())
            prev_state = curr_state
            curr_state, _, done, info = env.step(action)
            episode_t.append((prev_state, action, t+1))
            if done:
                break
        score.append(len(episode_t))
        # reward_batch = torch.Tensor([r for (s,a,r) in episode_t]).flip(dims=(0,))
        reward_batch = torch.Tensor([r for (s, a, r) in episode_t])


        # compute the return for every state-action pair from the rewards at every time-step
        batch_Gvals = []
        for i in range(len(episode_t)):
            new_Gval = 0
            power = 0
            for j in range(i, len(episode_t)):
                new_Gval = new_Gval + ((GAMMA ** power) * reward_batch[j]).numpy()
            power += 1
            batch_Gvals.append(new_Gval)



        # normalize the returns for the batch
        expected_returns_batch = torch.FloatTensor(batch_Gvals)
        if torch.is_nonzero(expected_returns_batch.max()):
            expected_returns_batch /= expected_returns_batch.max()



        # batch the states, actions, prob after the episode
        state_batch = torch.Tensor([s for (s,a,r) in episode_t])
        print("State batch:", state_batch)
        all_states = [s for (s,a,r) in episode_t]
        print("All states:", all_states)
        action_batch = torch.Tensor([a for (s,a,r) in episode_t])
        pred_batch_v1 = model(state_batch)
        pred_batch_v2 = torch.stack(pred_vals)
        print("Batched state pred_vals:", pred_batch_v1)
        print("Individual state pred_vals:", pred_batch_v2) ### Why is this different from the above predicted values??

My main function where I pass the environment is:

def main():
    env = gym.make('CartPole-v0')
    # train a REINFORCE-agent to learn the optimal policy
    episode_length = 500
    n_episodes = 500
    gamma = 0.99
    vis_steps = 50
    train_reinforce_agent(env, episode_length, n_episodes, gamma, vis_steps)

Solution

In your policy, you have Softmax over dim 0. This normalizes the probability of each action across your batch. You want to do it across actions by dim=1.