Search code examples
pythonmatlabmachine-learningneural-networkreinforcement-learning

Karpathy's code training neural net to play Pong using Policy Gradients


I'm looking at Andrej Karpathy's "Training a Neural Network ATARI Pong agent with Policy Gradients from raw pixels" https://gist.github.com/karpathy/a4166c7fe253700972fcbc77e4ea32c5 . I'm not a Python person so I'm trying to write this code in Matlab. I have 2 questions.

Question 1: I noticed that xs, hs, dlogps, and drs are initialized to [],[],[],[] (line 67) and reset to [],[],[],[] after each episode (line 103). But epx, eph, epdlogp, and epr are neither initialized nor reset. They seem to keep growing forever (lines 99-102). Am I correct? I'm not familiar with the nuances of np.vstack.

Question 2: If I had a game with player movement options up, down, right, and left, how would I need to modify this code to make it work (beside the obvious modification to 4 nodes in the output layer)?

Thanks.


Solution

  • I think you're imagining nuances of numpy.vstack that it doesn't have. Lines 99-102 of the code you linked to assign the result of the vstack function to the variables concerned. Any previous values of these variables will be replaced.

    epx = np.vstack(xs)
    eph = np.vstack(hs)
    epdlogp = np.vstack(dlogps)
    epr = np.vstack(drs)
    

    For the second part of your question I think you need to try something out, and ask a new question showing what you've tried if it doesn't work.