Search code examples
mergesnakemakewildcard-expansion

Generate many files with wildcard, then merge into one


I have two rules on my Snakefile: one generates several sets of files using wildcards, the other one merges everything into a single file. This is how I wrote it:

chr = range(1,23)

rule generate:
    input:
        og_files = config["tmp"] + '/chr{chr}.bgen',
    output:
        out = multiext(config["tmp"] + '/plink/chr{{chr}}',
                       '.bed', '.bim', '.fam')
    shell:
        """
        plink \
        --bgen {input.og_files} \
        --make-bed \
        --oxford-single-chr \
        --out {config[tmp]}/plink/chr{chr}
        """
rule merge:
    input:
        plink_chr = expand(config["tmp"] + '/plink/chr{chr}.{ext}',
                           chr = chr,
                           ext = ['bed', 'bim', 'fam'])
    output:
        out = multiext(config["tmp"] + '/all',
                       '.bed', '.bim', '.fam')
    shell:
        """
        plink \
        --pmerge-list-dir {config[tmp]}/plink \
        --make-bed \
        --out {config[tmp]}/all
        """

Unfortunately, this does not allow me to track the file coming from the first rule to the 2nd rule:

$ snakemake -s myfile.smk -c1 -np                                                                           
Building DAG of jobs...                                                                                                                                       
MissingInputException in line 17 of myfile.smk:                            
Missing input files for rule merge: 
[list of all the files made by expand()]   

What can I use to be able to generate the 22 sets of files with the wildcard chr in generate, but be able to track them in the input of merge? Thank you in advance for your help


Solution

  • In rule generate I think you don't want to escape the {chr} wildcard, otherwise it doesn't get replaced. I.e.:

            out = multiext(config["tmp"] + '/plink/chr{{chr}}',
                           '.bed', '.bim', '.fam')
    

    should be:

            out = multiext(config["tmp"] + '/plink/chr{chr}',
                           '.bed', '.bim', '.fam')