Let's say I have sample SAMPLE_A
, divided into two files SAMPLE_A_1, SAMPLE_A_2
and associated to barcodes AATT, TTAA
, and SAMPLE_B
associated to barcodes CCGG, GGCC, GCGC
, divided in 4 files SAMPLE_B_1...SAMPLE_B_4
.
I can create getSampleNames()
to get [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B]
and [1,2,1,2,3,4]
and then zip them to get the combination {sample}_{id}
. And then I can do the same thing for the barcodes: [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B]
and [AATT, TTAA,CCGG, GGCC, GCGC]
.
SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
"isoseq3 refine {input} "
rule split:
input:
expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
"python {params.script_dir}/split_cells_bam.py"
rule dedup_split:
input:
"{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
"isoseq3 dedup {input} {output.bam} "
rule merge:
input:
expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
zip, sample = SAMPLES_BC, barcode = BCs),
How can I prevent the rule split to be a bottleneck in my pipeline? For now it waits for the refine rule to be done for all samples while it's not necessary, every sample should run independently, but I can't because the set of barcodes is different for each sample. Is there a way to have something like
expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs[SAMPLES_BC])
, where each {sample}
of SAMPLES_BC
is a key in BCs
dictionary ? And same for IDs
? I know I can use functions, but then I'm not sure how to propagate the {barcode}
through the rules
I found how to use dictionaries through functions, which solved my problem!
The major default of this solution is that you have to create a dummy file as output of the split rule, instead of checking if each '{sample}/cells/{barcode}_{sample}/fltnc.bam' file is created, so I am still looking for something more elegant...
IDs = getSampleNames() #{SAMPLE_A:[1,2], SAMPLE_B:[1,2,3,4]}
SAMPLES = list(IDs.keys())
BCs = getBCs(SAMPLES) #{SAMPLE_A:[AATT, TTAA], SAMPLE_B:[CCGG,GGCC,GCGC]}
# function linking IDs and SAMPLE
def sample2ids(wildcards):
return expand('{{sample}}/polyA_trimming/{{sample}}_{id}.fltnc.bam',
id = IDs[wildcards.sample])
# function linking BCs and SAMPLE
def sample2ids(wildcards):
return expand('{{sample}}/cells/{barcode}_{{sample}}/dedup/dedup.bam',
barcode = BCs[wildcards.sample])
rule refine:
input:
'{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
rule split:
input:
sample2ids
output:
# cannot use a function here, so I create a dummy file to pipe
'dummy_file.txt'
rule dedup_split:
input:
'dummy_file.txt'
output:
bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
rule merge:
input:
sample2bc