Search code examples
pythonpandasworkflowsnakemakedirected-acyclic-graphs

Can a snakemake rule depend on data in the file instead of its change state


I have data in a CSV file that frequently changes. The CSV file is a source for a snakefile rule. My issue is that I want this rule to run only when a certain value appears in the data of the CSV file and not every time when the file changes. Is it possible to let rule execution depend on specific patterns in the file that has changed and not on the fact that it changed?


Solution

  • The specific check that Snakemake does to determine if a rule should be re-executed is based on timestamps (not file content), so first thing to do is to wrap relevant files in ancient.

    Next, since the Snakefile is a Python file, it's possible to incorporate the required logic using pandas or some other library for handling csvs. Below is a rough idea:

    import pandas as pd
    csv_file = 'some_file.txt'
    df = pd.read_csv(csv_file)
    items_to_do = df.query('column_x>=10')['column_y'].values.tolist()
    
    rule all:
        input: expand('file_out_{y}.txt', y=items_to_do)
    
    rule some_rule:
        input: ancient('test.csv')
        output: 'file_out_{y}.txt'
        ... # code to generate the file
    

    So if you update some_file.txt, but the values that are updated are associated with column_x being less than 10, then no new jobs will be executed.

    Update: I assumed that the rule in question generates multiple files using wildcards, but re-reading the question this doesn't seem to be the case. If it's just a single rule, then the snippet above can be modified to work along these lines:

    import pandas as pd
    csv_file = 'some_file.txt'
    
    def file_is_updated():
        df = pd.read_csv(csv_file)
        # implement logic to decide if the rule should be re-run
        # e.g. set to True if len() > 50
        needs_updating = True if len(df)>50 else False
        return needs_updating
    
    # use python to execute conditionally
    if file_is_updated():
        rule some_rule:
            input: csv_file
            ...