Search code examples
reinforcement-learningq-learning

Can I design a non-deterministic reward function in Q-learning?


In the Q-learning algorithm, there is a reward function that rewards the action taken on the current state. My question is can I have a non-deterministic reward function that is affected by the time when an action on a state is performed.

For example, suppose the reward for an action taken on a state at time 1PM is r(s,a). After several iterations (suppose now at 3PM), the system touches the same state and performs the same action as it did at 1PM. Should the reward given at 3PM must be the same as the one given at 1PM? Or the reward function can be designed by taking time into consideration (i.e., the reward given on the same state and the same action but at different time can be different).

Above is the question I want to ask, and one more thing I want to say is I don't want to treat time as a characteristic of a state. It is because in this case none of the state can be the same (time is always increasing).


Solution

  • My first though was your last sentence, i.e., to include the time as part of the state. As you said, time is always increasing, but it is also cyclical. So, maybe your reward function could depend on some repetitive feature of time. For example, everyday is 3PM at some point.

    On the other hand, the reward function could be stochastic, there is no limitation to deterministic functions. However, take into account that the policy will tend to optimize the expected returns. Therefore, if your agent is obtaining a totally different reward each time it visits the same [state, action] pair, probably there is something wrong in the way you are modelling your environment.