I have the situation where I am doing some computation in Python, and based on the outcomes I have a list of target files that are candidates to be passed to 2nd program.
For example, I have 50,000 files which contain ~2000 items each. I want to filter for certain items and call a command line program to do some calculation on some of those.
This Program #2 can be used via shell command line, but requires also a lengthy set of arguments. Because of performance reasons I would have to run Program #2 on a cluster.
Right now, I am running Program #2 via
'subprocess.call("...", shell=True)
But I'd like to run it via qsub in future.
I have not much experience of how exactly this could be done in a reasonably efficient manner.
Would it make sense to write temporary 'qsub' files and run them via subprocess() directly from the Python script? Is there a better, maybe more pythonic solution?
Any ideas and suggestions are very welcome!
It makes perfect sense, although I would go for another solution.
As far as I understand, you have programme #1 that determines which of your 50,000 files needs to be computed by programme #2. Both programme #1 and #2 are written in Python. Excellent choice.
Incidentally, I have a Python module that might come in handy: https://gist.github.com/stefanedwards/8841307
If you are running the same qsub-system as I have (no idea what ours is called), you cannot use command arguments on the submitted scripts. Instead, any options are submitted via the -v
option, that puts them into environment variables, e.g.:
[me@local ~] $ python isprime.py 1
1: True
[me@local ~] $ head -n 5 isprime.py
#!/usr/bin/python
### This is a python script ...
import os
os.chdir(os.environ.get('PBS_O_WORKDIR','.'))
[me@local ~] $ qsub -v isprime='1 2 3' isprime.py
123456.cluster.control.com
[me@local ~]
Here, isprime.py
could handle command line arguments using argparse
. Then you just need to check whether the script is running as a submitted job, and then retrieve said arguments from the environment variables (os.environ
).
When programme #2 is modified to be run on the cluster, programme #1 can submit jobs by using subprocess.call(['qsub','-v options=...','programme2.py'], shell=FALSE)
Another approach would be to queue all the files in a database (say, an SQLite database). Then you could have programme #1 check all non-processed entries in the database, determine the outcome (run, not run, run with special options). You now have the opportunity to run programme #2 in parallel on the cluster, which simply checks for the database for files to analyse.
Edit: When Programme #2 is an executable
Instead of a python script, we use a bash script that takes environment variables and puts them on a command line for the programme:
#!/bin/bash
cd .
# put options into context/flags etc.
if [ -n $option1 ]; then _opt1="--opt1 $option1"; fi
# we can even define our own defaults
_opt2='--no-verbose'
if [ -n $opt2 ]; then _opt2="-o $opt2"; fi
/path/to/exe $_opt1 $opt2
If you are going for the database solution, then have a python script that checks the database for unprocessed files, mark file as being processed (do these to in a single transaction), get options, call executable with subprocess
, when done, mark file as done, check for a new file, etc.