I am trying to download FASTQ files from an FTP server using snakemake which I'll post-process. Filenames are under columns "read1" and "read2" in data.tsv
. When I try the following code, I get the following error:
ValueError in line 17 ...
This IOFile is specified as a function and may not be used directly.
Line 17 refers to shell
. I tried googling around and the lambda function looks correct - also lambda functions are accepted in params
.
Here's my code:
import pandas as pd
samples = pd.read_table("data.tsv").set_index("sample", drop=False)
rule all:
input:
lambda wildcards: samples.to_dict()["read1"][wildcards.sample].split('/')[-1],
lambda wildcards: samples.to_dict()["read2"][wildcards.sample].split('/')[-1]
rule dl:
output:
temp(lambda wildcards: samples.to_dict()["read1"][wildcards.sample].split('/')[-1]),
temp(lambda wildcards: samples.to_dict()["read2"][wildcards.sample].split('/')[-1])
params:
read1 = lambda wildcards: samples.to_dict()["read1"][wildcards.sample],
read2 = lambda wildcards: samples.to_dict()["read2"][wildcards.sample]
shell:
"wget {params.read1}; wget {params.read2}"
Please help - I can't figure out what's wrong.
EDIT 1
In case it's helpful, the following code using remote files works (also suggested by euronion below):
import pandas as pd
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()
samples = pd.read_table("data.tsv").set_index("sample", drop=False)
rule all:
input:
expand("results/{sample}.sam", sample = samples["sample"])
rule bwa:
input:
v = "data/ref.fna",
read1 = lambda wildcards: FTP.remote(samples.loc[wildcards.sample, 'read1']),
read2 = lambda wildcards: FTP.remote(samples.loc[wildcards.sample, 'read2'])
output:
"results/{sample}.sam"
shell:
"scripts/bwa-mem2-2.2.1_x64-linux/bwa-mem2 mem {input.v} {input.read1} {input.read2} > {output}"
EDIT 2 The issue with my original attempt is snakemake doesn't allow lambda functions in output. So the following minimal working example:
read1={'s1': 'test1/ERR7671976_1.fastq.gz'}
read2={'s1': 'test1/ERR7671976_2.fastq.gz'}
rule all:
input:
lambda wildcards: read1[wildcards.sample],
lambda wildcards: read2[wildcards.sample]
rule test:
output:
lambda wildcards: read1[wildcards.sample],
lambda wildcards: read2[wildcards.sample]
params:
r1 = lambda wildcards: read1[wildcards.sample],
r2 = lambda wildcards: read2[wildcards.sample]
shell:
"""
touch {params.r1}
touch {params.r2}
"""
gets "SyntaxError: Only input files can be specified as functions", while the following (user-defined output filenames):
read1={'s1': 'test1/ERR7671976_1.fastq.gz'}
read2={'s1': 'test1/ERR7671976_2.fastq.gz'}
rule all:
input:
expand("{sample}_1.fastq.gz", sample=read1.keys()),
expand("{sample}_2.fastq.gz", sample=read2.keys())
rule test:
output:
'{sample}_1.fastq.gz',
'{sample}_2.fastq.gz'
params:
r1 = lambda wildcards: read1[wildcards.sample],
r2 = lambda wildcards: read2[wildcards.sample]
shell:
"""
touch {params.r1}; mv {params.r1} {wildcards.sample}_1.fastq.gz
touch {params.r2}; mv {params.r2} {wildcards.sample}_2.fastq.gz
"""
works fine.
In case it's helpful, the following code using remote files works (also suggested by euronion):
import pandas as pd
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()
samples = pd.read_table("data.tsv").set_index("sample", drop=False)
rule all:
input:
expand("results/{sample}.sam", sample = samples["sample"])
rule bwa:
input:
v = "data/ref.fna",
read1 = lambda wildcards: FTP.remote(samples.loc[wildcards.sample, 'read1']),
read2 = lambda wildcards: FTP.remote(samples.loc[wildcards.sample, 'read2'])
output:
"results/{sample}.sam"
shell:
"scripts/bwa-mem2-2.2.1_x64-linux/bwa-mem2 mem {input.v} {input.read1} {input.read2} > {output}"
The issue with my original attempt is snakemake doesn't allow lambda functions in output (see Edit 2 above).