Search code examples
pythonsnakemake

Snakemake - How to use every line of input file as wildcard


I am pretty new to using Snakemake and I have looked around on SO to see if there is a solution for the below - I am almost very close to a solution, but not there yet.

I have a single column file containing a list of SRA ids and I want to use snakemake to define my rules such that every SRA id from that file becomes a parameter on command line.

#FileName = Samples.txt
Samples
SRR5597645
SRR5597646
SRR5597647

Snakefile below:

from pathlib import Path
shell.executable("bash")
import pandas as pd
import os
import glob
import shutil

configfile: "config.json"

data_dir=os.getcwd()

units_table = pd.read_table("Samples.txt")
samples= list(units_table.Samples.unique())

#print(samples)

rule all:
    input:
           expand("out/{sample}.fastq.gz",sample=samples)

rule clean:
     shell: "rm -rf .snakemake/"

include: 'rules/download_sample.smk'

download_sample.smk

rule download_sample:
    """
    Download RNA-Seq data from SRA.
    """
    input: "{sample}"
    output: expand("out/{sample}.fastq.gz", sample=samples)
    params:
        outdir = "out",
        threads = 16
    priority:85
    shell: "parallel-fastq-dump --sra-id {input} --threads {params.threads} --outdir {params.outdir}  --gzip "

I have tried many different variants of the above code, but somewhere I am getting it wrong.

What I want: For every record in the file Samples.txt, I want the parallel-fastq-dump command to run. Since I have 3 records in Samples.txt, I would like these 3 commands to get executed

parallel-fastq-dump --sra-id SRR5597645 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597646 --threads 16 --outdir out --gzip
parallel-fastq-dump --sra-id SRR5597647 --threads 16 --outdir out --gzip

This is the error I get

snakemake -np
WildcardError in line 1 of rules/download_sample.smk:
Wildcards in input files cannot be determined from output files:
'sample'

Thanks in advance


Solution

  • It seems to me that what you need is to access the sample wildcard using the wildcards object:

    rule all:
        input: expand("out/{sample}_fastq.gz", sample=samples)
    
    rule download_sample:
        output: 
            "out/{sample}_fastq.gz"
        params:
            outdir = "out",
            threads = 16
        priority:85
        shell:"parallel-fastq-dump --sra-id {wildcards.sample} --threads {params.threads} --outdir {params.outdir}  --gzip "