I have a huge list that I need to process, which takes some time, so I divide it into 4 pieces and multiprocess each piece with some function. It still takes a bit of time to run with 4 cores, so I figured I would add some progress bar to the function, so that it could tell me where each processor is at in processing the list.
My dream was to have something like this:
erasing close atoms, cpu0 [######..............................] 13%
erasing close atoms, cpu1 [#######.............................] 15%
erasing close atoms, cpu2 [######..............................] 13%
erasing close atoms, cpu3 [######..............................] 14%
with each bar moving as the loop in the function progresses. But instead, I get a continuous flow:
etc, filling my terminal window.
Here is the main python script that calls the function:
from eraseCloseAtoms import *
from readPDB import *
import multiprocessing as mp
from vectorCalc import *
prot, cell = readPDB('file')
atoms = vectorCalc(cell)
output = mp.Queue()
# setup mp to erase grid atoms that are too close to the protein (dmin = 2.5A)
cpuNum = 4
tasks = len(atoms)
rangeSet = [tasks / cpuNum for i in range(cpuNum)]
for i in range(tasks % cpuNum):
rangeSet[i] += 1
rangeSet = np.array(rangeSet)
processes = []
for c in range(cpuNum):
na, nb = (int(np.sum(rangeSet[:c] + 1)), int(np.sum(rangeSet[:c + 1])))
processes.append(mp.Process(target=eraseCloseAtoms, args=(prot, atoms[na:nb], cell, 2.7, 2.5, output)))
for p in processes:
p.start()
results = [output.get() for p in processes]
for p in processes:
p.join()
atomsNew = results[0] + results[1] + results[2] + results[3]
Below is the function eraseCloseAtoms()
:
import numpy as np
import click
def eraseCloseAtoms(protein, atoms, cell, spacing=2, dmin=1.4, output=None):
print 'just need to erase close atoms'
if dmin > spacing:
print 'the spacing needs to be larger than dmin'
return
grid = [int(cell[0] / spacing), int(cell[1] / spacing), int(cell[2] / spacing)]
selected = list(atoms)
with click.progressbar(length=len(atoms), label='erasing close atoms') as bar:
for i, atom in enumerate(atoms):
bar.update(i)
erased = False
coord = np.array(atom[6])
for ix in [-1, 0, 1]:
if erased:
break
for iy in [-1, 0, 1]:
if erased:
break
for iz in [-1, 0, 1]:
if erased:
break
for j in protein:
protCoord = np.array(protein[int(j)][6])
trueDist = getMinDist(protCoord, coord, cell, vectors)
if trueDist <= dmin:
selected.remove(atom)
erased = True
break
if output is None:
return selected
else:
output.put(selected)
I see two issues in your code.
The first one explains why your progress bars are often showing 100%
rather than their real progress. You're calling bar.update(i)
which advances the bar's progress by i
steps, when I think you want to be updating by one step. A better approach would be to pass the iterable to the progressbar
function and let it do the updating automatically:
with click.progressbar(atoms, label='erasing close atoms') as bar:
for atom in bar:
erased = False
coord = np.array(atom[6])
# ...
However, this still won't work with multiple processes iterating at once, each with its own progress bar due to the second issue with your code. The click.progressbar
documentation states the following limitation:
No printing must happen or the progress bar will be unintentionally destroyed.
This means that whenever one of your progress bars updates itself, it will break all of the other active progress bars.
I don't think there is an easy fix for this. It's very hard to interactively update a multiple-line console output (you basically need to be using curses or a similar "console GUI" library with support from your OS). The click
module does not have that capability, it can only update the current line. Your best hope would probably be to extend the click.progressbar
design to output multiple bars in columns, like:
CPU1: [###### ] 52% CPU2: [### ] 30% CPU3: [######## ] 84%
This would require a non-trivial amount of code to make it work (especially when the updates are coming from multiple processes), but it's not completely impractical.