tensorflow machine-learning deep-learning lstm openai-api

Understanding the model of openAI 5 (1024 unit LSTM reinforcement learning)

I recently came across openAI 5. I was curious to see how their model is built and understand it. I read in wikipedia that it "contains a single layer with a 1024-unit LSTM". Then I found this pdf containing a scheme of the architecture.

My Questions

From all this I don't understand a few things:

What does it mean to have a 1024-unit LSTM layer? Does this mean we have 1024 time steps with a single LSTM cell, or does this mean we have 1024 cells. Could you show me some kind of graph visualizing this? I'm especially having a hard time visualizing 1024 cells in one layer. (I tried looking at several SO questions such as 1, 2, or the openAI 5 blog, but they didn't help much).
How can you do reinforcement learning on such model? I'm used to RL being used with Q-Tables and them being updated during training. Does this simply mean that their loss function is the reward?
How come such large model doesn't suffer from vanishing gradients or something? Haven't seen in the pdf any types of normalizations or so.
In the pdf you can see a blue rectangle, seems like it's a unit and there are N of those. What does this mean? And correct me please if I'm mistaken, the pink boxes are used to select the best move/item(?)

In general all of this can be summarized to "how does the openAI 5 model work?

Solution

It means that the size of the hidden state is 1024 units, which is essentially that your LSTM has 1024 cells, in each timestep. We do not know in advance how many timesteps we will have.
The state of the LSTM (hidden state) represents the current state that is observed by the agent. It gets updated every timestep using the input received. This hidden state can be used to predict the Q-function (as in Deep Q-learning). You don't have an explicit table of (state, action) -> q_value, instead you have a 1024 sized vector which represents the state and feeds into another dense layer, which will output the q_values for all possible actions.
LSTMs are the mechanism which help stop vanishing gradients, as the long range memory also allows the gradients to flow back easier.
If you are referring to the big blue and pink boxes, then the pink ones seem like they are the input values which are put through a network and pooled, over each pickup or modifier. The blue space seems to be the same thing over each unit. The terms pickup, modifier, unit, etc., should be meaningful in the context of the game they are playing.

Here is an image of the LSTM - the yellow nodes at each step are the n:

The vector h is the hidden state of the LSTM which is being passed to both the next timestep and being used as the output of that timestep.