What changes occur when using tf_agents.environments.TFPyEnvironment to convert a Python RL environment into a TF environment?

I noticed something weird happening when converting a Python environment into a TF environment using tf_agents.environments.TFPyEnvironment and I'd like to ask you what general changes occur.

To clarify the question please find below my code. I want the environment to simulate (in an oversimplied manner) interactions with a customers who want to buy fruits or vegetables. The agent should learn that when a customer asks for fruits, action 0 should be executed for example.

class CustomEnv(py_environment.PyEnvironment):
    
    def __init__(self):
        self._action_spec = array_spec.BoundedArraySpec(
            shape=(), dtype=np.int32, minimum=0, maximum=1)
        self._observation_spec = array_spec.BoundedArraySpec(
        shape=(1,1), dtype=np.int32, minimum=0, maximum=1)
        self._state = [0]
        self._counter = 0
        self._episode_ended = False
        self.dictionary = {0: ["Fruits"], 
                            1: ["Vegetables"]}
    
    def action_spec(self):
        return self._action_spec
    
    def observation_spec(self):
        return self._observation_spec
    
    def _reset(self):
        self._state = [0]
        self._counter = 0
        self._episode_ended = False
        return ts.restart(np.array([self._state], dtype=np.int32))
    
    def preferences(self):
        return np.random.randint(2)
    
    def pickedBasket(self, yes):
        reward = -1.0
        if yes:
            reward = 0.0
        return reward
    
    def _step(self, action):
        if self._episode_ended:
            self._reset()
        
        if self._counter<50:
            self._counter += 1
            
            basket = self.preferences()
            condition = basket in self.dictionary[action]
            reward = self.pickedBasket(condition)
            self._state[0] = basket
            
            if self._counter==50:
                self._episode_ended=True
                return ts.termination(np.array([self._state], 
                                               dtype=np.int32),
                                      reward,
                                      1)
            else:
                return ts.transition(np.array([self._state], 
                                              dtype=np.int32), 
                                     reward, 
                                     discount=1.0)

When I execute the following to code to check everything is working just fine:

py_env = ContextualMBA()
tf_env = tf_py_environment.TFPyEnvironment(py_env)
time_step = tf_env.reset()
action = 0
next_time_step = tf_env.step(action)

I get an unhashable type: 'numpy.ndarray' for the line condition = basket in self.dictionary[action] so I changed it into condition = basket in self.dictionary[int(action)] and it worked just fine. I'd also like to precise that it worked as a Python environment even without adding the int part. So I'd like to ask what changes the tf_agents.environments.TFPyEnvironment. I don't see how it can influence the type of action action since it isn't related to action_spec or anything (at least directly in the code).

Solution

Put basically, tf_agents.environments.TFPyEnvironment is a translator working between your Python environment and the TF-Agents API. The TF-Agents API does not know how many actions it is allowed to choose from, what data to observe and learn from or specially how the choice of actions will influence your custom environment.

Your custom environment is there to provide the rules of the environment and it follows some standards in order for the TFPyEnvironment to be able to translate it correctly so the TF-Agent can work with it. You need to define elements and methods in your custom environment, for example, such as:

__init__()
  self._action_spec
  self._observation_spec
_reset()
_step()

I'm not sure if your doubt came from the fact that you gave an action = 0 for the agent and, unrelated to the action_spec, the agent actually worked. The action_spec had no relation with your _step() function, and that is correct. Your step function takes some action and applies it to the environment. How this action is shaped is the real point.

The problem is you chose the value and gave it to the tf_env.step() function. If you had actually delegated the choice of action to the agent, by tf_env.step(agent.policy.action) (or tf_env.step(agent.policy.action.action), sometimes TF-Agents make me confuse), the agent would have to look to your action_spec definition to understand what the environment expects the action to look like.

If action_spec is not defined, the agent would not know what to choose between 0 for "Fruits", 1 for "Vegetables" - that you wanted, and defined - or unexpected results as 2 for "Meat", or [3, 2] for 2 bottles of water, since 3 could stand for "Bottle of Water". The TF-Agent needs these definitions so it knows the rules of your environment.

As for the actual changes and what they do with your custom environment code, I believe you would get a better idea by looking at the source code of the TF-Agents library.