My problem is the following. I have a simple grid world:
https://i.sstatic.net/xrhJw.png
The agent starts at the initial state labeled with START, and the goal is to reach the terminal state labeled with END. But, the agent has to avoid the barriers labeled with X and before reaching the END state it has to collect all items labeled with F. I implemented it by using Q-Learning and Sarsa as well, and the agent reaches the END state and avoid the barriers (X states). So this part works well.
My question is, how can I make agent to collect all the items (F states) before reaches END state? By using Q-Learning or Sarsa it avoids the barriers, reaches the END state but does not collect all the items. Usually one F state is visited and after the agent heading to the END state.
Thank you for your help!
You should always be sure that reaching the objective is the most 'attractive' way of interaction with the environment. You want your agent to reach a given objective and your agent tries to maximize the reward signal, so the thing you need to do, is to design a reward function that correctly 'guides' the agent to do the correct actions.
In the case you have described, it seems like to collect the most reward, agent should visit one F state and then go to the END state - so try to change the reward function to one that, for example, returns more reward for visiting the F states.
Other reward function I can imagine is a one that would return -1 for visiting the END state without collecting the items, 1 for visiting the END state with the items collected and 0 for visiting every other state (or e.g. -0.02 if you want the objective to be reached as fast as possible).
You can play with the reward function design - and my recommendation would be to experiment with it and observe the agent's behaviour. This is something that is usually a really nice way of getting both agent and environment to know and understand better.