Search code examples
pythonftpconfigsnakemake

How to handle ftp links provided in config file in snakemake?


I am attempting to build a snakemake workflow that will provide a symlink to a local file if it exists or if the file does not exist will download the file and integrate it into the workflow. To do this I am using two rules with the same output with preference given to the linking rule (ln_fastq_pe below) using ruleorder.

Whether the file exists or not is known before execution of the workflow. The file paths or ftp links are provided in tab-delimited config file that is used by the workflow to read in samples. e.g. the contents of samples.txt:

id      sample_name     fq1     fq2
b       test_paired     resources/SRR1945436_1.fastq.gz resources/SRR1945436_2.fastq.gz
c       test_paired2    ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR194/005/SRR1945435/SRR1945435_1.fastq.gz  ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR194/005/SRR1945435/SRR1945435_2.fastq.gz

relevant code from the workflow here:

import pandas as pd
from snakemake.remote.FTP import RemoteProvider as FTPRemoteProvider
FTP = FTPRemoteProvider()

configfile: "config/config.yaml"
samples = pd.read_table("config/samples.tsv").set_index("id", drop=False)
all_ids=list(samples["id"])

ruleorder: ln_fastq_pe > dl_fastq_pe
rule dl_fastq_pe:
    """
    download file from ftp link
    """
    input:
        fq1=lambda wildcards: FTP.remote(samples.loc[wildcards.id, "fq1"], keep_local=True),
        fq2=lambda wildcards: FTP.remote(samples.loc[wildcards.id, "fq2"], keep_local=True)
    output:
        "resources/fq/{id}_1.fq.gz",
        "resources/fq/{id}_2.fq.gz"
    shell:
        """
        mv {input.fq1} {output[0]}
        mv {input.fq2} {output[1]}
        """

rule ln_fastq_pe:
    """
    link file
    """
    input:
        fq1=lambda wildcards: samples.loc[wildcards.id, "fq1"],
        fq2=lambda wildcards: samples.loc[wildcards.id, "fq2"]
    output:
        "resources/fq/{id}_1.fq.gz",
        "resources/fq/{id}_2.fq.gz"
    shell:
        """
        ln -sr {input.fq1} {output[0]}
        ln -sr {input.fq2} {output[1]}
        """

When I run this workflow, I receive the following error pointing to the line describing the ln_fastq_pe rule.

WorkflowError in line 58 of /path/to/Snakefile:
Function did not return str or list of str.

I think the error is in how I am describing the FTP links in the samples.txt config file in the dl_fastq_pe rule. What is the proper way to describe FTP links given in a tabular config file so that snakemake will understand them and can download and use the files in a workflow?

Also, is it possible to do what I am trying to do and will this method get me there? I have tried other solutions (e.g. using python code to check if file exists and executing one set of shell commands if it does and the other if it doesn't) to no avail.


Solution

  • I figured out how to do this by omitting input and instead reading in the fields from samples.tsv through params and merging the two rules into one rule. Snakemake is not picky about what is read in through params unlike input. I then use test command to ask if a file exists. If it exists, proceed with the symlink, if not, download with wget.

    Solution is as follows:

    import os
    import pandas as pd
    
    samples = pd.read_table("config/samples.tsv").set_index("id", drop=False)
    all_ids=list(samples["id"])
    
    rule all:
        input:
            expand("resources/fq/{id}_1.fq.gz", id=all_ids),
            expand("resources/fq/{id}_2.fq.gz", id=all_ids)
    
    rule dl_fastq_pe:
        """
        if file exists, symlink. If file doesn't exist, download to resources
        """
        params:
            fq1=lambda wildcards: samples.loc[wildcards.id,"fq1"],
            fq2=lambda wildcards: samples.loc[wildcards.id,"fq2"]
        output:
            "resources/fq/{id}_1.fq.gz",
            "resources/fq/{id}_2.fq.gz"
        shell:
            """
            if test -f {params.fq1}
            then
                ln -sr {params.fq1} {output[0]}
                ln -sr {params.fq2} {output[1]}
            else
                wget --no-check-certificate -O {output[0]} {params.fq1}
                wget --no-check-certificate -O {output[1]} {params.fq2}
            fi
            """