Search code examples
snakemake

Why does `snakemake --list-untracked ...` list the `config.yaml` file in its output?


I'm learning snakemake and tried running my command with the --list-untracked option to see if I had any unaccounted for leftover outputs or unused inputs, like this:

snakemake --cores 12 --use-conda --directory test_data/small_fastq_samples/ --list-untracked

I found some abandoned reference inputs that I neglected to delete, but I also saw my config file in there too:

Building DAG of jobs...
config.yml
input/reference/gen.1.bt2
input/reference/gen.2.bt2
input/reference/gen.3.bt2
input/reference/gen.4.bt2
input/reference/gen.rev.1.bt2
input/reference/gen.rev.2.bt2

This surprised me, because the workflow reads from that config file:

configfile: "config.yml"
SAMPLES = config["sample_names"]
REFNAME = config["reference_name"]

rule all:
    input:
        expand("results/sorted_atac_alignments/QC/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=["zip", "html"]),
        expand("results/sorted_atac_alignments/{sample}.bam", sample=SAMPLES)

Have I made a mistake somewhere or is this intended behavior?

The doc doesn't seem to explain why it would list the config file. It suggests that it only lists files that are not used in the workflow:

--list-untracked, --lu

List all files in the working directory that are not used in the workflow. This can be used e.g. for identifying leftover files. Hidden files and directories are ignored.

Default: False

Assuming I don't have a mistake somewhere (because the workflow works as intended), what's the reasoning behind designating the config file as "not used"?

Additional Notes

If I run snakemake --lint from the main repo directory, I see:

WorkflowError in file /Users/rleach/PROJECT-local/ATACCOMPENDIUM/REPOS/ATACCompendium/Snakefile, line 3:
Workflow defines configfile config.yml but it is not present or accessible (full checked path: /Users/rleach/PROJECT-local/ATACCOMPENDIUM/REPOS/ATACCompendium/config.yml).
  File "/Users/rleach/PROJECT-local/ATACCOMPENDIUM/REPOS/ATACCompendium/Snakefile", line 3, in <module>

Though maybe the --lint utility isn't meant to be used on the main repo directory? If I run snakemake --lint --directory test_data/small_fastq_samples/, I see:

Congratulations, your workflow is in a good condition!

I'm still trying to get the hang of this. Is there a way to lint the codebase without having a dataset directory to run?

Given inferences I made from the linting output, I tried editing my rules file's all rule to include config.yml:

rule all:
    input:
        "config.yml",
        expand("results/sorted_atac_alignments/QC/fastqc/{sample}_fastqc.{ext}", sample=SAMPLES, ext=["zip", "html"]),
        expand("results/sorted_atac_alignments/{sample}.bam", sample=SAMPLES)

Now, if I run the command (as above) with --list-untracked, it does not output the config.yml:

$ snakemake --cores 12 --use-conda --directory test_data/small_fastq_samples/ --lint
Congratulations, your workflow is in a good condition!

So that seems to work. Is that bad practice? Or is there any reason/convention I shouldn't include the config file among the inputs?


Solution

  • The current implementation of snakemake --list-untracked only considers files which are part of your current DAG as input, output or log as "tracked". All other files are considered "untracked" and therefore reported.

    Adding your configfile as a dummy dependency to a rule like rule all only to get rid of the error message is something I would personally consider bad practice as it obfuscates dependencies. It could be misunderstood as a real dependency (e.g. the configfile could be required and accessed directly by the rule = real dependency).

    Personally I would just ignore it.

    If you want don't want config.yml to show up in the list of untracked files, I suggest you add a dummy rule with that sole purpose of tracking "untracked files", e.g.:

    configfile: "config.yml"
    
    # Include all files you want considered "tracked" by snakemake
    rule track_files:
        input:
            "config.yml"
    
    # This is the default_target rule which is executed if snakemake
    # is called without a rule name or file name as target
    rule all:
        input:
            # Make tracked files part of your DAG
            rules.track_files.input
        default_target: True
        shell:
            "echo 'hello world'"
    

    This way your intention of including the dependency becomes clear. You can also add more files to the track_files rule. Note:

    1. rule track_files has to be defined before rule all, else the rules.track_files.input can not be resolved
    2. default_target: True in rule all makes sure that rule is always run if snakemake is called without an explicit target
    3. If you call snakemake --list-untracked snakemake considers the DAG for the default rule, in your case all. If you provide a different rule or file as target, all files which are not part of that specific DAG will be reported as "untracked".