Search code examples
pythonbioinformaticssnakemake

Snakemake expand using dictionary values


I have a dictionary with keys as patient IDs and a list of fastq files as values.

patient_samples = {
  "patientA": ["sample1", "sample2", "sample3"],
  "patientB": ["sample1", "sample4", "sample5", "sample6"]
}

I want to align each sample.fastq and output the aligned .bam file in a directory for each patient. The resulting directory structure I want is this:

├── patientA
│   ├── sample1.bam
│   ├── sample2.bam
│   ├── sample3.bam
├── patientB
│   ├── sample1.bam
│   ├── sample4.bam
│   ├── sample5.bam
│   ├── sample6.bam

Here I used lambda wildcards to get the samples for each patient using the "patient_samples" dictionary.

rule align:
    input:
        lambda wildcards: \
            ["{0}.fastq".format(sample_id) \ 
            for sample_id in patient_samples[wildcards.patient_id]
            ]
    output:
        {patient_id}/{sample_id}.bam"
    shell:
        ### Alignment command

How can I write the rule all to reflect that only certain samples are aligned for each patient? I have tried referencing the dictionary key to specify the samples:

rule all:
    input:
        expand("{patient_id}/{sample_id}.bam", patient_id=patient_samples.keys(), sample_id=patient_samples[patient_id])

However, this leads to a NameError: name 'patient_id' is not defined

Is there another way to do this?


Solution

  • The error is because the expand command does not know what is the patient_id to use when listing the sample_id values:

    expand(
       "{patient_id}/{sample_id}.bam",
       patient_id=patient_samples.keys(),
       sample_id=patient_samples[patient_id])
                                    ^^^^^ Unknown
    

    Using expand is convenient when you already have lists with wildcard values, in more complex cases it's best to use python:

    list_inputs_all = [
       f"{patient_id}/{sample_id}.bam"
       for patient_id, sample_id
       in patient_samples.items()
    ]
       
    rule all:
        input:
            list_inputs_all