I am trying to create snakemake rule that take in input my fastq files and return in output a .sam file for each fastq
file.
I have a file like this:
FILE TYPE SM LB ID PU PL
xfgh.fastq.gz Single IND1 IND1 IND1 Platform Illumina
IND2.fastq.gz Single IND2 IND2 IND2 Platform Illumina
zfgv.fastq.gz Single IND3 IND3 IND3 Platform Illumina
IND4_P1.fastq.gz Single IND4 IND4 IND4 Platform Illumina
So I did something like that.
I open my dataframe with pandas:
pd.read_csv("info_file.txt")
and I stock in a list the columns file SM and ID
and i create my rule:
rule all:
input:
sam_file = expand("ALIGNEMENT/{sm}/{id}.sam", sm = info_df["SM"], id = info_df["ID"])
rule alignement:
input:
fastq_files = "PATH/TO/{fastq}"
output:
sam_file = "ALIGNEMENT/{sm}/{id}.sam"
I know input and output need to have the same wildcards but does there exist a method to have my input from the columns "FILES" of my file.txt and in output a path like that : "ALIGNEMENT/{sm}/{id}.sam"
where {sm} and {id} are SM and ID columns of my file.txt
I also want to launch one rule per files.
If any one can help me thanks you
I am trying to create snakemake rule that take in input my fastq files and return in output a .sam file for each fastq file.
From the above it seems to me that you want to add zip
to the expand
function in rule all
. With zip you pair wildcards as they appear in your input lists, without it you get all combinations of {id}
and {sm}
.
Then to get the input fastq file in rule alignment
, you need to query the info dataframe to get the FILE corresponding to a given id
. You can do this with a lambda function or write a dedicated function to use as input.
Here's my take on it:
import pandas as pd
info_df = pd.read_csv("info_file.txt", sep='\t')
rule all:
input:
expand("ALIGNEMENT/{sm}/{id}.sam", zip, sm = info_df["SM"], id = info_df["ID"])
rule alignement:
input:
fastq_files=lambda wc: info_df[info_df['ID'] == wc.id]['FILE'],
output:
sam_file = "ALIGNEMENT/{sm}/{id}.sam"
shell:
r"""
echo {input.fastq_files} > {output.sam_file}
"""