I've successfully implemented a Snakemake workflow and now want to add the possibility to merge specific samples, whether they are technical or biological replicates. In this case, I was told I should use multiple wildcards and work with a proper data frame. However, I'm pretty unsure what the syntax would even look like and how I can differentiate between entries properly.
My samples.tsv used to look like this:
sample | fq1 | fq2 |
---|---|---|
bCalAnn1_1_1 | path/to/file | path/to/file |
bCalAnn1_1_2 | path/to/file | path/to/file |
bCalAnn2_1_1 | path/to/file | path/to/file |
bCalAnn2_1_2 | path/to/file | path/to/file |
bCalAnn2_2_1 | path/to/file | path/to/file |
bCalAnn2_3_1 | path/to/file | path/to/file |
Now it looks like this:
sample | bio_unit | tech_unit | fq1 | fq2 |
---|---|---|---|---|
bCalAnn1 | 1 | 1 | path/to/file | path/to/file |
bCalAnn1 | 1 | 2 | path/to/file | path/to/file |
bCalAnn2 | 1 | 1 | path/to/file | path/to/file |
bCalAnn2 | 1 | 2 | path/to/file | path/to/file |
bCalAnn2 | 2 | 1 | path/to/file | path/to/file |
bCalAnn2 | 3 | 1 | path/to/file | path/to/file |
Where bio_unit contains the index of the library for one sample and tech_unit is the index of the lanes for one library.
My Snakefile looks like this:
import pandas as pd
import os
import yaml
configfile: "config.yaml"
samples = pd.read_table(config["samples"], index_col="sample")
rule all:
input:
expand(config["arima_mapping"] + "{sample}_{unit_bio}_{unit_tech}_R1.bam",
sample=samples.sample, unit_bio=samples.unit_bio, unit_tech=samples.unit_tech)
rule r1_mapping:
input:
read1 = lambda wc:
samples[(samples.sample == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
ref = config["PacBio_assembly"],
linker = config["index"] + "pac_bio_assembly.fna.amb"
output:
config["arima_mapping"] + "unprocessed_bam/({sample},{unit_bio},{unit_tech})_R1.bam"
params:
conda:
"../envs/arima_mapping.yaml"
log:
config["logs"] + "arima_mapping/mapping/({sample},{unit_bio},{unit_tech})_R1.log"
threads:
12
shell:
"bwa mem -t {threads} {input.ref} {input.read1} | samtools view --threads {threads} -h -b -o {output} 2> {log}"
r1_mapping
is basically my first rule which requires no differentiating between replicates. Therefore, I would want to use it for every row. I experimented a little bit with the syntax but ultimately ran into this error:
MissingInputException in rule r1_mapping in line 20 of /home/path/to/assembly_downstream/workflow/rules/arima_mapping.smk:
Missing input files for rule r1_mapping:
output: /home/path/to/assembly_downstream/intermed/arima_mapping/unprocessed_bam/('bCalAnn1', 1, 1)_R1.bam
wildcards: sample='bCalAnn1', unit_bio= 1, unit_tech= 1
affected files:
/home/path/to/assembly_downstream/data/arima_HiC/('bCalAnn1', 1, 1)_R1.fastq.gz
Do I even read the table properly? I´m currently very confused, can anyone give me a hint on where to start fixing this?
Fixed by changing:
samples = pd.read_table(config["samples"], index_col=["sample","unit_bio","unit_tech"])
to
samples = pd.read_table(config["samples"], index_col="sample")
EDIT: New error
Missing input files for rule all:
affected files:
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ... fq2
sample ...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 2 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 3 ... /home/path/to/assembly_downstream/data/ari...
[6 rows x 4 columns]>_3_2_R1.bam
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ... fq2
sample ...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn1 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 1 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 2 ... /home/path/to/assembly_downstream/data/ari...
bCalAnn2 3 ... /home/path/to/assembly_downstream/data/ari...
[6 rows x 4 columns]>_3_1_R1.bam
/home/path/to/assembly_downstream/intermed/arima_mapping/<bound method NDFrame.sample of unit_bio ...
This goes on to a total of 6 times which sounds like I´m currently getting all rows 6 times each red. I guess this has something to do with how expand()
works..? I´m gonna keep researching for now.
This is what I would do, slightly simplified to keep things shorter - not tested, there may be errors!
import pandas as pd
import os
import yaml
samples = pd.read_table('samples.tsv', dtype=str)
wildcard_constraints:
# Constraints maybe not needed but in my opinion better to set them
sample='|'.join([re.escape(x) for x in samples['sample']]),
unit_bio='|'.join([re.escape(x) for x in samples.unit_bio]),
unit_tech='|'.join([re.escape(x) for x in samples.unit_tech]),
rule all:
input:
# Assume for now you just want to produce these bam files,
# one per row of the sample sheet:
expand("scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam", zip,
sample=samples['sample'], unit_bio=samples.unit_bio, unit_tech=samples.unit_tech),
rule r1_mapping:
input:
read1 = lambda wc:
samples[(samples['sample'] == wc.sample) & (samples.unit_bio == wc.unit_bio) & (samples.unit_tech == wc.unit_tech)].fq1.iloc[0],
# Other input/params that don;t need wildcards...
output:
"scaffolds/{sample}_{unit_bio}_{unit_tech}_R1.bam",
The main issue is to use an input function (lambda here) to query the sample sheet and get the fastq file corresponding to a given sample, unit_bio, and unit_tech.
If this doesn't work or you can't make sense of it, post a fully reproducible example that we can easily replicate.