Search code examples
snakemake

Interconnected variables in snakemake


Let's say I have sample SAMPLE_A, divided into two files SAMPLE_A_1, SAMPLE_A_2 and associated to barcodes AATT, TTAA, and SAMPLE_Bassociated to barcodes CCGG, GGCC, GCGC, divided in 4 files SAMPLE_B_1...SAMPLE_B_4.

I can create getSampleNames() to get [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B,SAMPLE_B] and [1,2,1,2,3,4] and then zip them to get the combination {sample}_{id}. And then I can do the same thing for the barcodes: [SAMPLE_A,SAMPLE_A,SAMPLE_B,SAMPLE_B,SAMPLE_B] and [AATT, TTAA,CCGG, GGCC, GCGC].

SAMPLES_ID,IDs = getSampleNames()
SAMPLES_BC,BCs = getBCs(set(SAMPLES_ID))

rule refine:
input:
    '{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
output:
    bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
shell:
    "isoseq3 refine {input} "


rule split:
input:
    expand('{sample}/polyA_trimming/{sample}_{id}.fltnc.bam', zip, sample = SAMPLES_ID, id = IDs),
output:
    expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs),
shell:
    "python {params.script_dir}/split_cells_bam.py"


rule dedup_split:
input:
    "{sample}/cells/{barcode}_{sample}/fltnc.bam"
output:
    bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
shell:
    "isoseq3 dedup {input} {output.bam} "

rule merge:
input:
    expand("{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
        zip, sample = SAMPLES_BC, barcode = BCs),

How can I prevent the rule split to be a bottleneck in my pipeline? For now it waits for the refine rule to be done for all samples while it's not necessary, every sample should run independently, but I can't because the set of barcodes is different for each sample. Is there a way to have something like

expand("{sample}/cells/{barcode}_{sample}/fltnc.bam", zip, sample = SAMPLES_BC, barcode = BCs[SAMPLES_BC]), where each {sample} of SAMPLES_BC is a key in BCs dictionary ? And same for IDs? I know I can use functions, but then I'm not sure how to propagate the {barcode} through the rules


Solution

  • I found how to use dictionaries through functions, which solved my problem!

    The major default of this solution is that you have to create a dummy file as output of the split rule, instead of checking if each '{sample}/cells/{barcode}_{sample}/fltnc.bam' file is created, so I am still looking for something more elegant...

    IDs = getSampleNames() #{SAMPLE_A:[1,2], SAMPLE_B:[1,2,3,4]}
    SAMPLES = list(IDs.keys()) 
    BCs = getBCs(SAMPLES) #{SAMPLE_A:[AATT, TTAA], SAMPLE_B:[CCGG,GGCC,GCGC]}
        
    # function linking IDs and SAMPLE
    def sample2ids(wildcards):
        return expand('{{sample}}/polyA_trimming/{{sample}}_{id}.fltnc.bam', 
                   id = IDs[wildcards.sample])
    
    # function linking BCs and SAMPLE
    def sample2ids(wildcards):
        return expand('{{sample}}/cells/{barcode}_{{sample}}/dedup/dedup.bam',
                   barcode = BCs[wildcards.sample])
    
    rule refine:
    input:
        '{sample}/demultiplex/{sample}_{id}.demultiplex.bam'
    output:
        bam = '{sample}/polyA_trimming/{sample}_{id}.fltnc.bam',
    
    rule split:
    input:
        sample2ids
    output:
        # cannot use a function here, so I create a dummy file to pipe
        'dummy_file.txt'
    
    rule dedup_split:
    input:
        'dummy_file.txt'
    output:
        bam = "{sample}/cells/{barcode}_{sample}/dedup/dedup.bam",
    
    
    rule merge:
    input:
        sample2bc