├── DIR1
│ ├── smp1.fastq.gz
│ ├── smp1_fastqc/
│ ├── smp2.fastq.gz
│ └── smp2_fastqc/
└── DIR2
├── smp3.fastq.gz
├── smp3_fastqc/
├── smp4.fastq.gz
└── smp4_fastqc/
I would like to count the number of reads by sample and then concatenate all counts by directory.
I create a dictionnary to link sample 1 and 2 to directory 1, and sample 3 et 4 to directory 2
DIRS,SAMPLES = glob_wildcards(INDIR+'/{dir}/{smp}.fastq.gz')
# Create samples missing
def filter_combinator(combinator, authlist):
def filtered_combinator(*args, **kwargs):
for wc_comb in combinator(*args, **kwargs):
if frozenset(wc_comb) in authlist:
yield wc_comb
return filtered_combinator
# Authentification
combine_dir_samples = []
for dir in DIRS:
samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
for smp in samples:
combine_dir_samples.append( { "dir" : dir, "smp" : smp} )
combine_dir_samples = { frozenset( x.items() ) for x in combine_dir_samples }
dir_samples = filter_combinator(product, combine_dir_samples)
Then, I create a rule to count my reads by sample
rule all:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
rule countReads:
input:
INDIR+'/{dir}/{smp}_fastqc/fastqc_data.txt'
output:
INDIR+'/{dir}/{smp}_Nreads.txt'
shell:
"grep 'Total\ Sequences' {input} | awk '{{print {wildcards.dir},$3}}' > {output}"
---------------------------------------------------------------
# result ok
├── DIR1
│ ├── smp1_Nreads.txt
│ └── smp2_Nreads.txt
└── DIR2
├── smp3_Nreads.txt
└── smp4_Nreads.txt
> cat smp1_Nreads.txt
DIR1 15082186
But then, I would like to add a rule to concatenate my smp_Nreads.txt
files by directory
rule concatNreads:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
output:
INDIR+'/{dir}/Nreads_{dir}.txt'
shell:
"cat {input} > {output}"
------------------------------------------------------------------
# result
├── DIR1
│ └── Nreads_DIR1.txt
└── DIR2
└── Nreads_DIR2.txt
# but both files are identical
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
DIR2 11635831
DIR2 45924459
# I would like to have
> cat Nreads_DIR1.txt
DIR1 15082186
DIR1 22326081
> cat Nreads_DIR2.txt
DIR2 11635831
DIR2 45924459
I tried with different input syntax for my concat rule
expand(OUTFastq+'/{dir}/FastQC/{{smp}}_Nreads.txt', dir_samples, dir=DIRS)
lambda wildcards: expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
expand(OUTFastq+'/{dir}/FastQC/{wildcards.smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
I don't find any solution, it is like it doesn't care of my dictionnary for this rule.
EDIT
I tried to use a dictionnary instead of my combination filter_combinator
and to use a function as input of my rule to get samples.
dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}
def func(dir):
return dir_to_samples[dir]
rule all:
input:
lambda wildcards: expand(OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip', dir=wildcards.dir, smp=func(wildcards.dir))
rule fastQC:
input:
lambda wildcards: expand(INDIR+'/{dir}/{smp}.fastq.gz', dir=wildcards.dir, smp=func(wildcards.dir))
output:
OUTDIR+'/{dir}/FastQC/{smp}_fastqc.zip'
shell:
"fastqc {input} -o {OUTDIR}/{wildcards.dir}/FastQC/"
> AttributeError: 'Wildcards' object has no attribute 'dir'
First of all, I think that you have overcomplicated the solution making it less idiomatic for Snakemake. As the result you are experiencing the problems implementing the rule. Anyway, let me answer just the question in the form you've asked it.
There should be no surprise that both of your Nreads_DIRx.txt
files are identical, as the input doesn't depend on any wildcard in the output:
rule concatNreads:
input:
expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir_samples, dir=DIRS, smp=SAMPLES)
Here the expand
function resolves both dir
and smp
variables producing the list of fully specified filenames. What you need is something that really depends on the wildcards in your output:
rule concatNreads:
input:
lambda wildcards: ...
The {dir}
is fully specified with the wildcard from the output, so you don't need to assign the values to it from the DIRS
variable:
rule concatNreads:
input:
lambda wildcards: expand(INDIR+'/{dir}/{smp}_Nreads.txt', dir=wildcards.dir, smp=func(wildcards.dir))
Now the question is how to implement this func
function that produces the list of samples for a directory. It took me a while to understand your tricks with combine_dir_samples
and filter_combinator
, so I'm leaving that to you to implement the func
function using that code. But what you really need is a map from DIR -> SAMPLES:
dir_to_samples = {"DIR1": ["smp1", "smp2"], "DIR2": ["smp3", "smp4"]}
def func(dir):
return dir_to_samples[dir]
This dir_to_samples
could probably be evaluated even easier, but here is your modified solution:
for dir in DIRS:
samples, = glob_wildcards(INDIR+'/'+dir+'/{smp}.fastq.gz')
dir_to_samples.append({dir: samples})