Search code examples
ubuntutensorflowcluster-computingdistributed-computing

ubuntu create a tensorflow worker node


I am using tensorflow with python under Ubuntu

I read here about how to start working with a tensorflow cluster, i wish to setup another few machines to run tf and create a working cluster, and can't find any straight forward examples for how to set up machines as tf worker nodes.

should i set it up on stand alone machines and then bind them all to a cluster? should i set up a cluster (if so, then please refer to some example) and then install tf on the cluster as a cluster?

EDIT: The answers are good and eligible, i am looking to understand the way that tf cluster concept will interact with the Beowulf cluster consent and if i need the Beowulf cluster in any way here

Thanks


Solution

  • I think you missed the content at bottom of the page on how to run tensorflow as parameter server or workers, here are two parameter servers and two workers. The job_name says whether it's a parameter server or worker and the task_index tells the index of the machine in that group:

    # On ps0.example.com:
    $ python trainer.py \
         --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
         --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
         --job_name=ps --task_index=0
    # On ps1.example.com:
    $ python trainer.py \
         --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
         --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
         --job_name=ps --task_index=1
    # On worker0.example.com:
    $ python trainer.py \
         --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
         --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
         --job_name=worker --task_index=0
    # On worker1.example.com:
    $ python trainer.py \
         --ps_hosts=ps0.example.com:2222,ps1.example.com:2222 \
         --worker_hosts=worker0.example.com:2222,worker1.example.com:2222 \
         --job_name=worker --task_index=1