Search code examples
pythonfunctionlambdawildcardsnakemake

Parallelise output of input function in Snakemake


Hello Snakemake community,

I am having quite some troubles to define correctly a function in Snakemake and call it in the params section. The output of the function is a list and my aim is to use each item of the list as a parameter of a shell command. In other words, I would like to run multiple jobs in parallel of the same shell command with a different parameter.

This is the function:

import os, glob
def get_scontigs_names(wildcards):
   scontigs = glob.glob(os.path.join("reference", "Supercontig*"))
   files = [os.path.basename(s) for s in scontigs]
   return name

The output is a list that looks like:

['Supercontig0', 'Supercontig100', 'Supercontig2', ...]

The Snakemake rules are:

rule all:
    input:
        "updated/all_supercontigs.sorted.vcf.gz"
rule update_vcf:
    input:
        len="genome/genome_contigs_len_cumsum.txt",
        vcf="filtered/all.vcf.gz"
    output:
        cat="updated/all_supercontigs.updated.list"
    params:
        scaf=get_scontigs_names
    shell:
        """
        python 3.7 scripts/update_genomic_reg.py -len {input.len} -vcf {input.vcf} -scaf {params.scaf}
        ls updated/*.updated.vcf.gz > {output.cat}
        """

This code is incorrect because all the items of the list are loaded into the shell command when I call {params.scaf}. The current shell commands looks like:

python 3.7 scripts/update_genomic_reg.py -len genome/genome_contigs_len_cumsum.txt -vcf filtered/all.vcf.gz -scaf Supercontig0 Supercontig100 Supercontig2 ...

What I would like to get is:*

python 3.7 scripts/update_genomic_reg.py -len genome/genome_contigs_len_cumsum.txt -vcf filtered/all.vcf.gz -scaf Supercontig0

python 3.7 scripts/update_genomic_reg.py -len genome/genome_contigs_len_cumsum.txt -vcf filtered/all.vcf.gz -scaf Supercontig100

and so on.

I have tried to use wildcards inside the function but I am failing to give it the correct attribute.

There are several posts about input functions and wildcards plus the snakemake docs but I could not really apply them to my case. Can somebody help me with this, please?


Solution

  • I have found the solution to my question inspired by @dariober.

    rule all:
    input:
        "updated/all_supercontigs.updated.list"
    
    import os, glob
    
    def get_scontigs_names(wildcards):
        scontigs = glob.glob(os.path.join("reference", "Supercontig*"))
        files = [os.path.basename(s) for s in scontigs]
        name = [i.split('_')[0] for i in files]
        return name
    
    rule update_vcf:
        input:
            len="genome/genome_contigs_len_cumsum.txt",
            vcf="filtered/all.vcf.gz"
        output:
            vcf="updated/all_{supercontig}.updated.vcf.gz"
        params:
            py3=config["modules"]["py3"],
            scaf=get_scontigs_names
        shell:
            """
            {params.py3} scripts/update_genomic_reg.py -len {input.len} -vcf 
            {input.vcf} -scaf {wildcards.supercontig}
            """
    
    
    rule list_updated:
        input:
            expand("updated/all_{supercontig}.updated.vcf.gz", supercontig = 
            supercontigs)
        output:
            "updated/all_supercontigs.updated.list"
        shell:
            """
            ls {input} > {output}
            """