Search code examples
expandmany-to-onesnakemake

Snakemake - Many to one using an expand exception


I have a functioning Snakefile for Partitioned Heritability using multiple bed files. This produces a perfect list of jobs using snakemake -np so this file only needs a minor tweak (I hope!).

My issue occurs in the merge_peaks rule below.

At this stage, I have 25 bed files and need to run 5 calls of merge_peaks, one call for each extension ext=[100,200,300,400,500], so I need only the 5 bed files containing the relevant extension to be called each time.

For example for the following merge_peaks output file peak_files/Fullard2018_peaks.mrgd.blrm.100.bed, I only need the following 5 bed files with ext=100 to be used as input:

peak_files/fullard2018_NpfcATAC_1.blrm.100.bed
peak_files/fullard2018_NpfcATAC_2.blrm.100.bed
peak_files/fullard2018_NpfcATAC_3.blrm.100.bed
peak_files/fullard2018_NpfcATAC_4.blrm.100.bed
peak_files/fullard2018_NpfcATAC_5.blrm.100.bed

Here is my config file:

samples:
    fullard2018_NpfcATAC_1:
    fullard2018_NpfcATAC_2:
    fullard2018_NpfcATAC_3:
    fullard2018_NpfcATAC_4:
    fullard2018_NpfcATAC_5:
index: /home/genomes_and_index_files/hg19.chrom.sizes

Here is the Snakefile:

# read config info into this namespace
configfile: "config.yaml"
print (config['samples'])

rule all:
    input:
        expand("peak_files/{sample}.blrm.{ext}.bed", sample=config['samples'], ext=[100,200,300,400,500]),
        expand("LD_annotation_files/Fullard2018.{ext}.{chr}.l2.ldscore.gz", sample=config['samples'], ext=[100,200,300,400,500], chr=range(1,23))

rule annot2bed:
    input:
        folder = "Reference/baseline"
    params:
        file = "Reference/baseline/baseline.{chr}.annot.gz"
    output:
        "LD_annotation_files/baseline.{chr}_no_head.bed"
    shell:
        "zcat {params.file} | tail -n +2 |awk -v OFS=\"\t\" '{{print \"chr\"$1, $2-1, $2, $3, $4}}' "
        "| sort -k1,1 -k2,2n > {output}"

rule extend_bed:
    input:
        "peak_files/{sample}_peaks.blrm.narrowPeak"
    output:
        "peak_files/{sample}.blrm.{ext}.bed"
    params:
        ext = "{ext}",
        index = config["index"]
    shell:
        "bedtools slop -i {input} -g {params.index} -b {params.ext} > {output}"

rule merge_peaks:
    input:
        expand("peak_files/{sample}.blrm.{ext}.bed", sample = config['samples'], ext=[100,200,300,400,500])
    output:
        "peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed"
    shell:
        "cat {input} | bedtools sort -i stdin | bedtools merge -i stdin > {output}"
rule intersect_mybed:
    input:
        annot = rules.annot2bed.output,
        mybed = rules.merge_peaks.output
    output:
        "LD_annotation_files/Fullard2018.{ext}.{chr}.annot.gz"
    params:
        uncompressed = "LD_annotation_files/Fullard2018.{ext}.{chr}.annot"
    shell:
        "echo -e \"CHR\tBP\tSNP\tCM\tANN\" > {params.uncompressed}; "
        "/share/apps/bedtools intersect -a {input.annot} -b {input.mybed} -c | "
        "sed 's/^chr//g' | awk -v OFS=\"\t\" '{{print $1, $2, $4, $5, $6}}' >> {params.uncompressed}; "
        "gzip {params.uncompressed}"

rule ldsr:
    input:
        annot = "LD_annotation_files/Fullard2018.{ext}.{chr}.annot.gz",
        bfile_folder = "Reference/1000G_plinkfiles",
        snps_folder = "Reference/hapmap3_snps"
    output:
        "LD_annotation_files/Fullard2018.{ext}.{chr}.l2.ldscore.gz"
    conda:
        "envs/p2-ldscore.yaml"
    params:
        bfile = "Reference/1000G_plinkfiles/1000G.mac5eur.{chr}",
        ldscores = "LD_annotation_files/Fullard2018.{ext}.{chr}",
        snps = "Reference/hapmap3_snps/hm.{chr}.snp"
    log:
        "logs/LDSC/Fullard2018.{ext}.{chr}_ldsc.txt"
    shell:
        "export MKL_NUM_THREADS=2;" # Export arguments are  workaround as ldsr uses all available cores
        "export NUMEXPR_NUM_THREADS=2;" # Numbers must match cores parameter in cluster config
        "Reference/ldsc/ldsc.py --l2 --bfile {params.bfile} --ld-wind-cm 1 "
        "--annot {input.annot} --out {params.ldscores} --print-snps {params.snps} 2> {log}"

What is currently happening is all 25 bed files are being fed to the merge peaks rule for each call, whereas I only need the 5 with the relevant extension to be fed in each time. I'm struggling to work out how to use the expand function correctly, or an alternate method, to include and merge only the relevant bed files in each call of the rule.

This question asks something similar, I think, but is not quite what I'm looking for as it doesn't use a config file - Snakemake: rule for using many inputs for one output with multiple sub-groups

I love Snakemake but my python is a bit dicey.

Many Thanks.


Solution

  • If I understand correctly, you managed to create one file per sample per extension (25 files in total) and now you want to merge files having the same extension. So your required final output should be:

    peak_files/Fullard2018_peaks.mrgd.blrm.100.bed, 
    peak_files/Fullard2018_peaks.mrgd.blrm.200.bed, 
    peak_files/Fullard2018_peaks.mrgd.blrm.300.bed, 
    peak_files/Fullard2018_peaks.mrgd.blrm.400.bed, 
    peak_files/Fullard2018_peaks.mrgd.blrm.500.bed
    

    (For testing I create the 25 input files to be merged by extension):

    mkdir -p peak_files
    for i in 100 200 300 400 500
    do
        touch peak_files/fullard2018_NpfcATAC_1.blrm.${i}.bed
        touch peak_files/fullard2018_NpfcATAC_2.blrm.${i}.bed
        touch peak_files/fullard2018_NpfcATAC_3.blrm.${i}.bed
        touch peak_files/fullard2018_NpfcATAC_4.blrm.${i}.bed
        touch peak_files/fullard2018_NpfcATAC_5.blrm.${i}.bed
    done
    

    This snakefile should do the job. Of course, you can move samples and exts to config entries:

    samples= ['fullard2018_NpfcATAC_1', 
              'fullard2018_NpfcATAC_2',
              'fullard2018_NpfcATAC_3', 
              'fullard2018_NpfcATAC_4', 
              'fullard2018_NpfcATAC_5']
    
    exts= [100, 200, 300, 400, 500]
    
    rule all:
        input:
            expand('peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed', ext= exts),
    
    rule merge_peaks:
        input:
            lambda wildcards: expand('peak_files/{sample}.blrm.{ext}.bed', 
                sample= samples, ext= wildcards.ext),
        output:
            'peak_files/Fullard2018_peaks.mrgd.blrm.{ext}.bed',
        shell:
            r"""
            cat {input} > {output}
            """
    

    The lambda function in merge_peaks is saying for each extension ext give me a list of files, one file for each sample in "samples"