Search code examples
pythonlist-comprehensionwildcardsnakemakewildcard-expansion

improve snakemake rule in few aspects


I have a rule of which I use a loop in the shell parameter. Not sure if this is snakemake's best practice to utilize loops in the shell section. Perhaps someone could tell me whether or not this is the case. I also produce many files from such rules but I was not able to leverage the wildcard properties of this rule. Here is the code:

def create_chrom():
    chroms = ['chr'+str(i) for i in range(1,23)]
    chroms.extend(['chrX','chrY']) 
    return chroms

rule split_by_chr:
    input:
        vcf = 'output/all_families.vcf.gz',
        vcf_idx = 'output/all_families.vcf.gz.csi'
    output:
        out = expand('output/by_chrom/all_families_{chroms}.vcf.gz',chroms=create_chrom())
    shell:
        'module load bcftools && '
        'bcftools index -s {input.vcf} | cut -f 1 | grep -w "^chr[1-9]\|^chr[1-9][0-9]\|^chr[X-Y]" | while read C; '
        'do bcftools view -O z -o output/by_chrom/all_families_${{C}}.vcf.gz {input.vcf} -r ${{C}}; done'

Notice that the function create_chrom() is producing a list of chromosome names and this list is being used in the output section, but not in the shell section. I would like to know whether it is possible to use this list one element at a time on the shell section.

Edit:

Would be greate to interpolate each item of the list returned by create_chrom() in this rule:

def create_chrom():
    chroms = ['chr'+str(i) for i in range(1,23)]
    chroms.extend(['chrX','chrY']) 
    return chroms


rule all:
    input: expand('output/by_chrom/all_families_{chroms}.vcf.gz',chroms=create_chrom())

rule split_by_chr:
    input:
        vcf = 'output/all_families.vcf.gz',
        vcf_idx = 'output/all_families.vcf.gz.csi'
    param:
        chroms = create_chrom()
    output:
        out = 'output/by_chrom/all_families_{chroms}.vcf.gz'
    shell:
        'bcftools view -O z -o {output.out} {input.vcf} -r {param.chroms}'

I'm aware that like this I'm passing the whole list, but I need to pass one chromosome at a time. i.e run once with chr1 run again with chr2 ... etc


Solution

  • Assuming that shell command produces specific-wildcard output, then the rule definition should contain the representation without expanding the wildcards:

    rule split_by_chr:
        input:
            vcf = 'output/all_families.vcf.gz',
            vcf_idx = 'output/all_families.vcf.gz.csi'
        output:
            out = 'output/by_chrom/all_families_{chroms}.vcf.gz'
        shell:
            """
            echo "{output.out}"
            # whatever is the appropriate command
            """
            
    

    Then, to collect all the outputs, expand wildcards in the desired output using all variations into the default rule (typically rule all):

    rule all:
        input: expand('output/by_chrom/all_families_{chroms}.vcf.gz',chroms=create_chrom())
    

    edit: to avoid defining function create_chrom, it's possible to pass explicit list of wildcard values to expand in rule all definition:

    list_chroms = ['chr'+str(i) for i in range(1,23)] + ['chrX','chrY']
    
    rule all:
        input: expand('output/by_chrom/all_families_{chroms}.vcf.gz',chroms=list_chroms)