Search code examples
pythonshellyamlconfigsnakemake

Snakemake trouble accessing nested values in config.yaml


So my issue below is partially solved, however now I'm trying to pass a variable as input in rule all and resolve it to get dependent variables as inputs in another rule. My code:

rule all:
        [f"outputs/STAR/all/{x}/counts_2.txt" for x in config["method"]]

rule feature_counts_per_sample:
    input:
        bam=[f"outputs/STAR/{name}/Aligned.sortedByCoord.out.sortedbyname.bam" for name in config["method"][{x}]],
        gtf="data/chr19_20Mb.gtf"
    output:
        outA="outputs/STAR/all/{x}/counts_1.txt",
        outB="outputs/STAR/all/{x}/counts_2.txt"
    shell:
        "mkdir -p outputs/STAR/all/{wildcards.x}/ && featureCounts -p -t exon -g gene_id -a {input.gtf} -o {output.outA} -s 1 {input.bam} && featureCounts -p -t exon -g gene_id -a {input.gtf} -o {output.outB} -s 2 {input.bam}"

The problem is with the input.bam - I get the name 'x' is not defined error and cannot find a way to resolve it. Besides that, I know the code works because if I replace the {x} with a constant value I get expected results. Is there a way to do this or should I be looking for a completely different approach?


I'm having trouble accessing nested values from my config.yaml file. My config.yaml:

method:
    collibri:
        - Collibri_standard_protocol-HBR-Collibri-100_ng-2_S1_L001_R
        - Collibri_standard_protocol-HBR-Collibri-100_ng-3_S2_L001_R
        - Collibri_standard_protocol-UHRR-Collibri-100_ng-2_S3_L001_R
        - Collibri_standard_protocol-UHRR-Collibri-100_ng-3_S4_L001_R
    kapa:
        - KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-3_S8_L001_R
        - KAPA_mRNA_HyperPrep_-HBR-KAPA-100_ng_total_RNA-2_S5_L001_R
        - KAPA_mRNA_HyperPrep_-HBR-KAPA-100_ng_total_RNA-3_S6_L001_R
        - KAPA_mRNA_HyperPrep_-UHRR-KAPA-100_ng_total_RNA-2_S7_L001_R

num:
    - 1
    - 2

type:
    - collibri
    - kapa

And my goal is to call all files from method groups as inputs at once and direct output to folder which would have the method name on it (e.g. run rule using all names under 'kapa' at once and place the output in 'kapa' folder). Shortened version of my Snakefile:

configfile: "config.yaml"

rule all:
    input:
        expand("outputs/STAR/{filename}/Aligned.sortedByCoord.out.bam.bai", filename=config["method"]["collibri"]),
        expand("outputs/STAR/{filename}/counts_2.txt", filename=config["method"]["collibri"]),
        expand("outputs/STAR/{filename}/Aligned.sortedByCoord.out.bam.bai", filename=config["method"]["kapa"]),
        expand("outputs/STAR/{filename}/counts_2.txt", filename=config["method"]["kapa"]),
        expand("outputs/STAR/{type}/counts_2.txt", type=config["type"])

rule bam_index:
    input:
        "outputs/STAR/{filename}/Aligned.sortedByCoord.out.bam"
    output:
        "outputs/STAR/{filename}/Aligned.sortedByCoord.out.bam.bai"
    shell:
        "samtools index {input}"
        
rule bam_sort_name:
    input:
        "outputs/STAR/{filename}/Aligned.sortedByCoord.out.bam"
    output:
        "outputs/STAR/{filename}/Aligned.sortedByCoord.out.sortedbyname.bam"
    shell:
        "samtools sort -n -o {output} {input}"

rule feature_counts:
    input:
        bam="outputs/STAR/{filename}/Aligned.sortedByCoord.out.sortedbyname.bam",
        gtf="data/chr19_20Mb.gtf"
    output:
        out1="outputs/STAR/{filename}/counts_1.txt",
        out2="outputs/STAR/{filename}/counts_2.txt"
    shell:
        "featureCounts -p -t exon -g gene_id -a {input.gtf} -o {output.out1} -s 1 {input.bam} && featureCounts -p -t exon -g gene_id -a {input.gtf} -o {output.out2} -s 2 {input.bam}"

rule feature_counts_per_sample:
    input:
        bam=expand("outputs/STAR/{name}/Aligned.sortedByCoord.out.sortedbyname.bam", name=config["method"][{type}]),
        gtf="data/chr19_20Mb.gtf"
    output:
        out1="outputs/STAR/{type}/counts_1.txt",
        out2="outputs/STAR/{type}/counts_2.txt"
    shell:
        "mkdir -p outputs/STAR/{type}/ && featureCounts -p -t exon -g gene_id -a {input.gtf} -o {output.out1} -s 1 {input.bam} && featureCounts -p -t exon -g gene_id -a {input.gtf} -o {output.out2} -s 2 {input.bam}"

So overall there are two issues that I cannot solve:

  • Is there a way for me to call all list items under 'method' so I don't have to define the same output in rule_all twice with different config extensions (filename=config["method"]["collibri"] and filename=config["method"]["kapa"], for rules rule bam_index and rule feature_counts)?
  • The rule 'feature_counts_per_sample' does not work (ofc), but this was my latest attempt at using variables 'collibri' and 'kapa' in one place and expanding them to list of filenames that need to be passed as inputs at the same time in another place. Any advise here?

Solution

  • This line is wrong:

        bam=[f"outputs/STAR/{name}/Aligned.sortedByCoord.out.sortedbyname.bam" for name in config["method"][{x}]],
    

    Snakemake will know specific value of x only at the time of rule evaluation, so the command above will lead to an error. To postpone the evaluation you will need to use lambda wildcards syntax:

       bam=lambda wildcards: [f"outputs/STAR/{name}/Aligned.sortedByCoord.out.sortedbyname.bam" for name in config["method"][wildcards.x]],