Search code examples
multiple-instancessnakemake

How to write snakefile code that only runs in the first instance of possible multiple sub-instances


I want to be able to write code in my Snakefile that will only be executed upon the initial invocation of the Snakefile, and will not be executed if snakemake reruns the Snakefile as a sub-instance because I specified the -j option to use multiple cores. How can I do this?

I am not talking about workflow code, but python code in the snakefile that performs various tasks related to preparing to state the workflow rules.

I have several places where I want to do this, some because there is no need to do it multiple times and I want to speed up the snakefile by doing it only in the first initial invocation. For example, one part of my snakefile code checks to see if certain pipeline include files (NOT input and output files of the actual pipeline) have been edited by the user, and if so, backs them up, and I don't want every sub-instance scanning dates on all these files and making a backup if necessary. In fact, race conditions exist where multiple instances try to back up the same file.


Solution

  • I found a way to do it.

    # Create Boolean variable isFirstInstance, True if this is the first snakemake
    # instance of a run of snakemake, False if it is nested sub-instance.
    #
    # This determines whether or not this is the first snakemake instance by creating
    # a unique file with each initial run of the snakefile, whose name is created
    # much as tempFile() creates files, but we don't use tempFile() because we don't
    # want to delete this file when any instance exits, only when the first instance
    # exits.  The file name includes the process group ID, which will be the same
    # for the first instance and for sub-instances.  The file contains one line, the
    # process ID of its creator.  If the file doesn't exist, it is created and we
    # set the variable isFirstInstance True to indicate that this is the first
    # instance of the pipeline.  If the file exists and the process ID it contains
    # matches the process ID of one of the parents of the current process, then the
    # current process is not the first instance of this pipeline invocation, and
    # so we set isFirstInstance False.  Two other aberrant situations can arise.
    # First, if the file exists and its contained process ID matches the process ID
    # of THIS process, we presume that the file was for some reason not deleted from
    # a previous run, and that run happened to have a process group ID and process
    # ID matching the current one, and so we assume we are first instance, and we
    # delete the file and recreate it so its date matches the current date.  Second,
    # if the file exists and DOES NOT contain the process IDs of one of our parents,
    # we make the same presumption of undeleted old file, and again delete the file,
    # then rewrite it with our process ID.
    ################################################################################
    
    # Create file name containing our process group ID in the name.
    initialInstancePIDfile = TMP_DIR + "/initialInstancePID." + str(os.getpgrp()) + ".tmp"
    
    # If file doesn't exist, this is first instance.  Create the file.
    myPID = str(os.getpid())
    if not os.path.exists(initialInstancePIDfile):
        f = open(initialInstancePIDfile, "wt")
        f.write(myPID)
        f.close()
        isFirstInstance = True
        #print("Instance file does not exist, created it:", initialInstancePIDfile, "and myPID =", myPID)
    else:
        # Otherwise, read the process ID from the file and see if it matches ours.
        f = open(initialInstancePIDfile, "rt")
        fPID = f.readlines(1)[0]
        f.close()
        if fPID == myPID:
            f = open(initialInstancePIDfile, "wt")
            f.write(myPID)
            f.close()
            isFirstInstance = True
            print("Instance file existed already, with our PID: ", myPID, " so we presumed it was a leftover and deleted and recreated it.")
        else:
            isFirstInstance = None
            # It doesn't match ours, does it match one of our parents?
            try:
                lastPID = None
                parentPID = myPID
                while parentPID != lastPID:
                    lastPID = parentPID
                    parentPID = str(psutil.Process(int(lastPID)).ppid())
                    #print("Parent ID is:", parentPID)
                    if parentPID == fPID:
                        isFirstInstance = False
                        #print("Instance file contains the PID of one of our parents:", fPID, initialInstancePIDfile, "and myPID =", myPID)
                        break
            except:
                pass
            # If it doesn't match a parent either, it is a leftover file from a
            # previous invocation.  Replace it with a new file.
            if isFirstInstance is None:
                f = open(initialInstancePIDfile, "wt")
                f.write(myPID)
                f.close()
                isFirstInstance = True
                print("Instance file existed already, with a PID:", fPID, "not matching ours:", myPID,
                    "or a parent, so we presumed it was a leftover and deleted and recreated it.")
    if isFirstInstance:
        print("Initial pipeline instance running.")
    else:
        print("Pipeline sub-instance running.")