Search code examples
pythonsubprocesstensorboardslurm

How can I start Tensorboard dev from within Python as a subprocess parallel to the Python session?


I want to monitor training progress of a CNN which is trained via a slurm process on a server (i.e., the Python script is executed through a bash script whenever the server has resources available; the session is not interactive. Hence, I cannot simply open a terminal and run Tensorboard dev).

So far, I have tried the following without finding a new experiment on my Tensorboard dev site:

mod = "SomeModelType"
logdir = "/some/directory/used/in/Tensorboard/callback"
PARAMETERS = "Some line of text describing the training settings"

subprocess.Popen(["tensorboard", "dev upload --logdir '" + logdir + \
                  "' --name Myname_" + mod + " --description '" + \
                      PARAMETERS + "'"])

If I insert the text string "tensorboard dev upload --logdir 'some/directory..." in a terminal, Tensorboard will start as expected. If I include the code showed above, no new Tensorboard experiment will be started.

I also tried this:

subprocess.run(["/pfs/data5/home/kit/ifgg/mp3890/.local/bin/tensorboard", \
                "dev", "upload", "--logdir", "'" + logdir + \
                "'", "--name", "LeleNet" + mod#, "--description" + "'" + \
                    #PARAMETERS + "'"
                    ], \
               capture_output = False, text = False)

which starts Tensorboard, but it will not continue the Python script. Hence, Tensorboard, will be listening to output that never comes, because the Python session is listening to its own output instead of training the CNN.

Edit This:

subprocess.Popen(["/pfs/data5/home/kit/ifgg/mp3890/.local/bin/tensorboard", \
                "dev", "upload", "--logdir", "'" + logdir + \
                "'", "--name", "LeleNet" + mod#, "--description" + "'" + \
                    #PARAMETERS + "'"
                    ])

led to some message "Listening for new data in the log dir..." popping up all the time in interactive mode and led to cancellation of the slurm job (job disappeared). Moreover, Tensorboard does not work correcty this way. The experiment is created, but never receives any data.


Solution

  • I got it to work as follows:

    logdir = "/some/directory"
    tbn = "some_name"
    DESCRIPTION = "some description of the experiment"
    
    subprocess.call("tensorboard dev upload --logdir '" + logdir + \
                        "' --name " + tbn + " --description '" + \
                        DESCRIPTION + "' &", shell = True)