Search code examples
ubuntubioinformaticssnakemake

How to enable "output" to detect variables in "run" in snakemake


I am trying to make directories for the different "IDs" I have stored in the .csv file (they are stored under column "ID"). However, snakemake doesn't seem to be able to detect "IDs" in output from run.

rule make_directories:
    input:"exptXXXXX_metadata.csv"
    output:
        directory(expand("tissues/{id}", id = IDs))
    run:
        import pandas as pd
        df =  pd.read_csv('exptXXXXX_metadata.csv')

        ##unique IDs which I want to make different directories of under the parent_dir "tissues"
        IDs =  set(df['ID'])

        for i in IDs:
            f = output[id]
            shell("mkdir {f}")

I have tried different suggestions from the snakemake documentation: https://snakemake.readthedocs.io/en/stable/project_info/faq.html#how-do-i-access-elements-of-input-or-output-by-a-variable-index


Solution

  • Snakemake resolves all jobs in the following order: output, then input, then run (or shell or script).

    So you can't refer to variables in the output that are calculated later within the run part of the rule.

    My guess is that you want to put the code that reads the CSV file and obtains the ID list outside of the rule completely. Then the IDs variable will be available for use in the output: part of the rule.