Search code examples
gitpythonqsub

Using gitpython to get current hash does not work when using qsub for job submission on a cluster


I use python to do my data analysis and lately I came up with the idea to save the current git hash in a log file so I can later check which code version created my results (in case I find inconsistencies or whatever).

It works fine as long as I do it locally.

import git
import os
rep = git.Repo(os.getcwd(), search_parent_directories=True)
git_hash = rep.head.object.hexsha
with open ('logfile.txt', 'w+') as writer:
    writer.write('Code version: {}'.format(git_hash))

However, I have a lot of heavy calculations that I run on a cluster to speed things up (run analyses of subjects parallel), using qsub, which looks more or less like this:

qsub -l nodes=1:ppn=12 analysis.py -q shared

This always results in a git.exc.InvalidGitRepositoryError.

EDIT

Printing os.getcwd() showed me, that on the cluster the current working dir is always my $HOME directory no matter from where I submit the job. My next solution was to get the directory where the file is located using some of the solutions suggested here.

However, these solutions result in the same error because (that's how I understand it) my file is somehow copied to a directory deep in the root structure of the cluster's headnode (/var/spool/torque/mom_priv/jobs).

I could of course write down the location of my file as a hardcoded variable, but I would like a general solution for all my scripts.


Solution

  • So after I explained my problem to IT in detail, they could help me solve the problem.

    Apparently the $PBS_O_WORKDIR variable stores the directory from which the job was committed.

    So I adjusted my access to the githash as follows:

    try:
        script_file_directory = os.environ["PBS_O_WORKDIR"]
    except KeyError:
        script_file_directory = os.getcwd()
        
    try:
        rep = git.Repo(script_file_directory, search_parent_directories=True)
        git_hash = rep.head.object.hexsha
    except git.InvalidGitRepositoryError:
        git_hash = 'not-found'
        
    # create a log file, that saves some information about the run script
    with open('logfile.txt'), 'w+') as writer:
        writer.write('Codeversion: {} \n'.format(git_hash))
    

    I first check if the PBS_O_WORKDIR variable exists (hence if I run the script as a job on the cluster). If it does get the githash from this directory if it doesn't use the current working directory.

    Very specific, but maybe one day someone has the same problem...