tensorflow deep-learning tensorflow2.0 distributed-computing

Roles of parameter servers and workers

What exact role do parameter servers and workers have the during distributed training of neural networks? (e.g. in Distributed TensorFlow)

Perhaps breaking it down as follows:

During the forward pass
During the backward pass

For example:

Are parameter servers only responsible for storing and providing variable values in an ACID store?
Do different parameter servers manage different variables in the graph?
Do parameter servers receive gradients themshelves (and thus adding them up)?

Solution

Parameter Servers — This is actually same as a worker. Typically it’s a CPU where you store the variables you need in the workers. In my case this is where I defined the weights variables needed for my networks

Workers — This is where we do most of our computation intensive work.

In the forward pass — We take variables from Parameter servers, do something with them on our workers

In the backward pass — We send the current state back to the parameter servers which do some update operation and give us the new weights to try out

Are parameter servers only responsible for storing and providing variable values in an ACID store? ==> Yes, as per Tensorflow Documentation and Medium Article.

Do different parameter servers manage different variables in the graph? ==> Yes, inferred from the statement,

In addition, to that you can decide to have more than one parameter server for efficiency reasons. Using parameters the server can provide better network utilization, and it allows to scale models to more parallel machines. It is possible to allocate more than one parameter server.

from this link.

Do parameter servers receive gradients themselves (and thus adding them up)? ==> No. AFAIK, it receives the Updated Weights because computation of Gradients and modifying the Weights using the Formula,

W1 = W0 - Learning Rate * Gradients

happens in the Workers.