Search code examples
hpcsnakemakeglobus-toolkit

How to make Snakemake recognize Globus remote files using Globus CLI?


I am working in a high performance computing grid environment, where large-scale data transfers are done via Globus. I would like to use Snakemake to pull data from a Globus path, process the data, and then push the processed data to a different Globus path. Globus has a command-line interface.

Pulling the data is no problem, for I'd just create a rule that would run globus transfer to create the requisite local file. But for pushing the data back to Globus, I think I'll need a rule that can "see" that the file is missing at the remote location, and then work backwards to determine what needs to happen to create the file.

I could create local "proxy" files that represent the remote files. For example I could make a rule for creating 'processed_data_1234.tar.gz' output files in a directory. These files would just be created using touch (thus empty), and the same rule will run globus transfer to push the files remotely. But then there's the overhead of making sure that the proxy files don't get out of sync with the real Globus-hosted files.

Is there a more elegant way to do this akin to the Remote File capability? Is it difficult to add a Globus CLI support for Snakemake? Thanks in advance for any advice!


Solution

  • Would it help to create a utility function that would generate a list of all desired files and compare it against the list of files available on globus? Something like this (pseudocode):

    def return_needed_files():
        list_needed_files = [] # either hard-coded or specified with some logic
        list_available = [] # as appropriate, e.g. using globus ls
        return [i for i in list_needed_files if i not in list_available]
    
    # include all the needed files in the all rule
    rule all:
        input: return_needed_files