Search code examples
pythonsnakemakefastq

Combine multiple wildcards in Snakemake


├── DIR1
│   ├── smp1.fastq.gz
│   ├── smp1_fastqc/
│   ├── smp2.fastq.gz
│   └── smp2_fastqc/
└── DIR2
    ├── smp3.fastq.gz
    ├── smp3_fastqc/
    ├── smp4.fastq.gz
    └── smp4_fastqc/

I would like to count the number of reads by sample and then concatenate all counts by directory.
I create a dictionnary to link sample 1 and 2 to directory 1, and sample 3 et 4 to directory 2

DIRS,SAMPLES = glob_wildcards(INDIR+'/{dir}/{smp}.fastq.gz')

# Create samples missing
def filter_combinator(combinator, authlist):
    def filtered_combinator(*args, **kwargs):
        for wc_comb in combinator(*args, **kwargs):
            if frozenset(wc_comb) in authlist:
                yield wc_comb
    return filtered_combinator

# Authentification
combine_dir_samples = []

for dir in DIRS:
    samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
    for smp in samples:
        combine_dir_samples.append( { "dir" : dir, "smp" : smp} )
       
combine_dir_samples = { frozenset( x.items() ) for x in combine_dir_samples }
dir_samples = filter_combinator(product, combine_dir_samples)

Then, I create a rule to count my reads by sample

rule all:
    input:
        expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)

rule countReads:
    input:
        INDIR+'/{dir}/{smp}_fastqc/fastqc_data.txt'
    output:
        INDIR+'/{dir}/{smp}_Nreads.txt'
    shell:
        "grep 'Total\ Sequences' {input} | awk '{{print {wildcards.dir},$3}}' > {output}"

---------------------------------------------------------------
# result ok

├── DIR1
│   ├── smp1_Nreads.txt
│   └── smp2_Nreads.txt
└── DIR2
    ├── smp3_Nreads.txt
    └── smp4_Nreads.txt

> cat smp1_Nreads.txt
DIR1 15082186

But then, I would like to add a rule to concatenate my smp_Nreads.txt files by directory

rule concatNreads:
    input:
        expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
    output:
        INDIR+'/{dir}/Nreads_{dir}.txt'
    shell:
        "cat {input} > {output}"
------------------------------------------------------------------
# result

├── DIR1
│   └── Nreads_DIR1.txt
└── DIR2
    └── Nreads_DIR2.txt

# but both files are identical
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
DIR2 11635831
DIR2 45924459

# I would like to have
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
> cat Nreads_DIR2.txt
DIR2 11635831
DIR2 45924459

I tried with different input syntax for my concat rule

expand(OUTFastq+'/{dir}/FastQC/{{smp}}_Nreads.txt', dir_samples, dir=DIRS)
lambda wildcards: expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)

I don't find any solution, it is like it doesn't care of my dictionnary for this rule.


EDIT

I tried to use a dictionnary instead of my combination filter_combinator and to use a function as input of my rule to get samples.

dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}

def func(dir):
    return dir_to_samples[dir]

rule all:
    input:
        lambda wildcards: expand(OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip', dir=wildcards.dir, smp=func(wildcards.dir))

rule fastQC:
    input:
        lambda wildcards: expand(INDIR+'/{dir}/{smp}.fastq.gz', dir=wildcards.dir, smp=func(wildcards.dir))
    output:
        OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip'
    shell:
        "fastqc {input} -o {OUTDIR}/{wildcards.dir}/FastQC/" 

> AttributeError: 'Wildcards' object has no attribute 'dir'

Solution

  • First of all, I think that you have overcomplicated the solution making it less idiomatic for Snakemake. As the result you are experiencing the problems implementing the rule. Anyway, let me answer just the question in the form you've asked it.

    There should be no surprise that both of your Nreads_DIRx.txt files are identical, as the input doesn't depend on any wildcard in the output:

    rule concatNreads:
        input:
            expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
    

    Here the expand function resolves both dir and smp variables producing the list of fully specified filenames. What you need is something that really depends on the wildcards in your output:

    rule concatNreads:
        input:
            lambda wildcards: ...
    

    The {dir} is fully specified with the wildcard from the output, so you don't need to assign the values to it from the DIRS variable:

    rule concatNreads:
        input:
            lambda wildcards: expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir=wildcards.dir, smp=func(wildcards.dir))
    

    Now the question is how to implement this func function that produces the list of samples for a directory. It took me a while to understand your tricks with combine_dir_samples and filter_combinator, so I'm leaving that to you to implement the func function using that code. But what you really need is a map from DIR -> SAMPLES:

    dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}
    
    def func(dir):
        return dir_to_samples[dir]
    

    This dir_to_samples could probably be evaluated even easier, but here is your modified solution:

    for dir in DIRS:
        samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
        dir_to_samples.append({dir: samples})