Search code examples
pythonshared-memorysnakemake

Using STAR shared memory module in Snakemake for sequential alignment tasks


I'm writing a Snakemake pipeline for scRNAseq sequence processing which uses STAR as the alignment tool.

Loading genome index into the memory for each alignment job is very time-consuming. Since I have a lot of memory at my disposal, I figure that it is feasible to use the "shared memory" module from STAR. With it I can load the genome into the memory and keep it there until all the jobs are done.

With shell this is very easy to achieve, for example here.

But chianing it in a Snakemake pipeline seems a non-trivial task, especially that the STAR "shared memory" module doesn't require any input or output anything. For example, I tried:

# Not full pipeline
rule umi_tools_extract:
    input:
        read1="{sample}_R1.fq.gz",
        read2="{sample}_R2.fq.gz",
        whitelist="whitelist.txt"
    output:
        "{sample}_R1_extracted.fq.gz"
    shell:
        """
        umi_tools extract --bc-pattern=CCCCNNNN \
                          --stdin {input.read1} \
                          --read2-in {input.read2} \
                          --stdout {output} \
                          --read2-stdout --filter-cell-barcode --error-correct-cell \
                          --whitelist={input.whitelist}
        """

rule STAR:
    input:
        fq="{sample}_R1_extracted.fq.gz",
        genomeDir=directory("/path/to/genome")
    output:
        "{sample}_Aligned.sortedByCoord.out.bam"
    threads:32
    shell:
        """
        STAR --runThreadN {threads} \
             --genomeLoad LoadAndKeep \
             --genomeDir {input.genomeDir} \
             --readFilesIn {input.fq} \
             --readFilesCommand zcat \
             --limitBAMsortRAM 20000000000 \
             --outFilterMultimapNmax 1 \
             --outFilterType BySJout \
             --outSAMstrandField intronMotif \
             --outFilterIntronMotifs RemoveNoncanonical \
             --outFilterMismatchNmax 6 \
             --outSAMtype BAM SortedByCoordinate \
             --outFileNamePrefix {wildcards.sample}_
        """
... ...

rule STAR_unload:
    input:
        genomeDir=dirctory("/path/to/genome")
#    output:  ## Put something here??
#        ""
    shell:
        """
        STAR --genomeLoad Remove \
             --genomeDir {input.genomeDir} \
        """

But obviously this wouldn't work. I'm thinking of putting a dummy output in the STAR_unload rule and feed it into rules downstream of STAR so that the STAR_unload will be triggered after alignment jobs are done.

Any help would be appreciated!


Solution

  • the STAR "shared memory" module doesn't require any input or output anything

    I'm not familiar with STAR and the shared memory option, but the issue about input/output can be easily resolved using dummy, flag files that signal a step is done. E.g.:

    rule index_genome:
        input:
            fa= 'genome.fa',
        output:
            done= touch('index.done'),
        shell:
            r"""
            STAR index ... {input.fa} 
            """
    
    rule load_genome:
        input:
            'index.done',
        output:
            touch('loading.done'),
        shell:
            r"""
            STAR --genomeLoad LoadAndExit --genomeDir ...
            """
    
    rule align:
         input:
             fq= 'reads.fastq',
             idx= 'loading.done',
         output:
             ...
         shell:
             r"""
             STAR ...
             """
    
    rule unload_genome:
        # Delete the loading.done flag file otherwise subsequent runs of the pipeline 
        # will fail to load the genome again if STAR alignment is needed.
        input:
            # Generic function that aggregates all alignment outputs
            bams= get_files(),
            idx= 'loading.done',
        output:
            "logs/STARunload_Log.out"
        shell:
            r"""
            STAR --genomeLoad Remove \
                 --genomeDir {input.genomeDir} \
                 --outFileNamePrefix logs/STARunload_
            rm {input.idx}
            """"