Search code examples
pandassnakemake

Snakemake, pandas, and NCBI: how do I combine a pandas dataframe with the remote NCBI search?


I'm still pretty new to Snakemake, and I've been having trouble with a rule I'm trying to write.

I've been trying to combine using snakemake.remote.NCBI with accessing a pandas dataframe and using a wildcard, but I can't seem to make it work.

I have a tsv file called genomes.tsv with several columns, where each row is one species. One is "id" and has the genbank id for the species's genomes. Another "species" has a short string unique for each species. In my Snakefile, genomes.tsv is imported as genomes, with only the id and species column, then species is set as genomes index and dropped from genome.

I want to use the values in "species" as values for the wildcard {species} in my workflow, and I want my rule to use snakemake.remote.NCBI to download each species's genome sequence in fasta format and then output it to a file "{species}_gen.fa"

from snakemake.remote.NCBI import RemoteProvider as NCBIRemoteProvider
import pandas as pd

configfile: "config.yaml"

NCBI = NCBIRemoteProvider(email=config["email"]) # email required by NCBI to prevent abuse

genomes = pd.read_table(config["genomes"], usecols=["species","id"]).set_index("species")

SPECIES = genomes.index.values.tolist()

rule all:
    input: expand("{species}_gen.fasta",species=SPECIES)

rule download_and_count:
    input:
        lambda wildcards: NCBI.remote(str(genomes[str(wildcards.species)]) + ".fasta", db="nuccore")
    output:
        "{species}_gen.fasta"
    shell:
        "{input} > {output}"

Currently, trying to run my code results in a key error, but it says that the key is a value from species, so it should be able to get the corresponding genbank id from genomes.

EDIT: here is the error

InputFunctionException in line 18 of /home/sjenkins/work/olflo/Snakefile:
KeyError: 'cappil'
Wildcards:
species=cappil

cappil is a valid value for {species}, and it should be usable as an index, I think. Here are the first few rows of genomes, for reference:

species id  accession   name    assembly
cappil  8252558 GCA_004027915.1 Capromys_pilorides_(Desmarest's_hutia)  CapPil_v1_BIUU
cavape  1067048 GCA_000688575.1 Cavia_aperea_(Brazilian_guinea_pig) CavAp1.0
cavpor  175118  GCA_000151735.1 Cavia_porcellus_(domestic_guinea_pig)   Cavpor3.0

Update:

I tried changing the the input line to:

lambda wildcards: NCBI.remote(str(genomes[genomes['species'] == wildcards.species].iloc[0]['id']) + ".fasta", db="nuccore")

but that gives me the error message:

Traceback (most recent call last): File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/init.py", line 547, in snakemake export_cwl=export_cwl) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/workflow.py", line 421, in execute dag.init() File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 122, in init job = self.update([job], progress=progress) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 603, in update progress=progress) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 666, in update_ progress=progress) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 603, in update progress=progress) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/dag.py", line 655, in update_ missing_input = job.missing_input File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/jobs.py", line 398, in missing_input for f in self.input File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/jobs.py", line 399, in if not f.exists and not f in self.subworkflow_input) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/io.py", line 208, in exists return self.exists_remote File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/io.py", line 119, in wrapper v = func(self, *args, **kwargs) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/io.py", line 258, in exists_remote return self.remote_object.exists() File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/remote/NCBI.py", line 72, in exists likely_request_options = self._ncbi.guess_db_options_for_extension(self.file_ext, db=self.db, rettype=self.rettype, retmode=self.retmode) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/remote/NCBI.py", line 110, in file_ext accession, version, file_ext = self._ncbi.parse_accession_str(self.local_file()) File "/home/sjenkins/miniconda3/envs/olflo/lib/python3.7/site-packages/snakemake/remote/NCBI.py", line 366, in parse_accession_str assert file_ext, "file_ext must be defined: {}.{}.. Possible values include: {}".format(accession,version,", ".join(list(self.valid_extensions))) AssertionError: file_ext must be defined: ... Possible values include: est, ssexemplar, gb.xml, docset, fasta.xml, fasta, fasta_cds_na, abstract, txt, gp, medline, chr, flt, homologene, alignmentscores, gbwithparts, seqid, fasta_cds_aa, gpc, uilist, uilist.xml, rsr, xml, gb, gene_table, gss, ft, gp.xml, acc, asn1, gbc


Solution

  • Ok, so it turns out that the reason that I was getting AssertionError: file_ext must be defined: is that NCBIRemoteProvider can't recognize the file extension if the file name that its given doesn't have a valid Genbank accession number. I was giving it file names with genbank ids, so it returned that error.

    Also, it seems like the whole genome sequences don't have an accession number that returns all the sequences. Instead there's an accession number for the wgs report and then accession numbers for each scaffold. I decided to try downloading the genomes I need manually instead of trying to download all the scaffolds and then combine them.