Search code examples
pythonbashshellbioinformaticssnakemake

Using cutadapt adapter file inputs in Snakemake


I am new to Snakemake, trying to port my BASH code to work with Snakemake rules. Disclaimer out of the way, I am attempting to demultiplex a fastq file with cutadapt in a shell block, using a fasta file with headers and adapter sequences. Samplenames matching the fasta headers are loaded in and specified as a wild card in rule all:

rule all:
input:
    expand([os.path.join(outDir, "reads_trim.fq"),
            os.path.join(outDir, "{name}.tmp")], name=samples_list)

The reads_trim file is a standard, concatenated fastq file (from Nanopore sequencing). My problem is that this rule is run 12 times, since I have 12 samples in my "samples_list". I then get 12 output files, as I should, with correct naming, but they all contain the exact same. I wonder if output would be correct if it would run just once instead, but haven't found a way to force that behaviour.

rule demultiplex:
""" Demultiplex Nextera-styled barcodes using the compiled barcode fasta file """
input:
    barcodeFile=os.path.join(outDir, "barcodes_used.fasta"),
    fastqFile=os.path.join(outDir, "reads_trim.fq")
output: os.path.join(outDir, "{name}.tmp")
message: "Demultiplexing trimmed reads"
log: "logs/cutadapt/demultiplexing_{name}.log"
shell:
    """
    cutadapt -e 0.20 -O 10 -m 15 -M 15 -g file:{input.barcodeFile} -o {output} {input.fastqFile}
    """

Any idea how to get the standard behaviour where one file goes in, x files come out with different output (where the adapter was found)?


Solution

  • I'm not familiar with cutadapt for demultiplexing but from a quick look at the docs, I think you want something along these lines (I simplified your original code):

    rule all:
        input:
            expand("{name}.tmp", name= samples_list),
    
    rule demultiplex:
        input:
            barcodeFile= "barcodes_used.fasta",
            fastqFile= "reads_trim.fq"
        output: 
            expand("{name}.tmp", name=samples_list),
        shell:
            r"""
            cutadapt ... -g file:{input.barcodeFile} -o {{name}}.tmp {input.fastqFile}
            """
    

    If the input reads_trim.fq produces all the 12 output files in one shot, then you need to list all the 12 outputs in the output directive. I used the expand() function for that. Note that {{name}}.tmp has double curly braces to tell snakemake that it's not a wildcard but it's to be read exactly as {name} (as per cutadapt docs if I read them right).

    By the way, are you sure that cutadapt can demultiplex Nanopore?