Search code examples
reinforcement-learningq-learning

human trace data for evaluation of reinforcement learning agent playing Atari?


In recent reinforcement learning researches about Atari games, agents performance is evaluated by human start.

In the human start evaluation, learned agents begin episodes of randomly sampled point from a human professional's game-play.

My question is:
Where can I get this human professional's game-play trace data?
For fare comparison, the trace data should be same among each research but I could not find the data.


Solution

  • I'm not aware of that data being publicly available anywhere. Indeed, as far as I know all the papers that use such human start evaluations were written by the same lab/organization (DeepMind), so that doesn't rule out the possibility that DeepMind has kept the data internal and hasn't shared it with external researchers.

    Note that the paper Revisiting the Arcade Learning Environment: Evaluation Protocols and Open Problems for General Agents proposes a different (arguably better) approach for introducing the desired stochasticity in the environment to disincentivize an algorithm from simply memorizing strong sequences of actions. Their approach, referred to as sticky actions, is described in Section 5.2 of that paper. In 5.3 they also describe numerous disadvantages of other approaches, including disadvantages of the human starts approach.

    In addition to arguably simply being a better approach, the sticky actions approach also has the advantage that it can very easily be implemented and used by all researchers, allowing for fair comparisons. So, I'd strongly recommend simply using the sticky actions instead of human starts. The disadvantage obviously is that you can't compare results easily anymore to results reported in those DeepMind papers with human starts, but those evaluations have numerous flaws as described in the paper linked above anyway (human starts can be considered as one flaw, but they also often have other flaws, such as reporting results of best run instead of reporting average of multiple runs, etc.).