Search code examples
rftpwindows-10bioinformaticsfirewall

Unable to download files using FTP from R


I posted yesterday about a problem with my code that I found out today is oddly not a problem with my code?

I've been trying to download genomic data from NCBI databases using the biomartr package:

install.packages("devtools")
library(devtools)
devtools::install_github("hadley/devtools")
devtools::install_github("HajkD/biomartr", build_vignettes = TRUE, dependencies = TRUE)
library(biomartr)
options(timeout = 300000)
is.genome.available(db = "refseq", organism = "Homo sapiens")
MtbCDC1551 <- getGenome(db = "refseq", 
                        organism = "GCF_000008585.1",  
                        path   = "~/Rdir/_ncbi_downloads/genomes",
                        reference = FALSE)

What I'm trying to download in this case is kind of irrelevant, as the plan is to just specify whatever I want and run this using that, but right now I'm running into a problem:

>MtbCDC1551 <- getGenome(db = "refseq", 
+                         organism = "GCF_000008585.1",  
+                         path   = "~/Rdir/_ncbi_downloads/genomes",
+                         reference = FALSE)
Starting genome retrieval of 'GCF_000008585.1' from refseq ...


The download session seems to have timed out at the FTP site 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/585/GCF_000008585.1_ASM858v1/GCF_000008585.1_ASM858v1_genomic.fna.gz'. This could be due to an overload of queries to the databases. Please restart this function to continue the data retrieval process or wait for a while before restarting this function in case your IP address was logged due to an query overload on the server side.
Error: Please provide a valid file path to your genome assembly file. 

I've tried this on a Mac, and it works. I own a windows PC running windows 10, and this error persists. On the mac I was running 4.2.0 but literally just pasted this from the manual on the windows laptop trying both versions to no avail. What do I need to do settings wise on Windows to allow this? I've given all the firewall permissions to the R executable and Rstudio that I can think of, what am I missing


Solution

  • It sounds like this is an ongoing issue for bacterial genomes (due to programmatic issues generating the correct link to fetch data?). I use the following to fetch B. burgdorferi's genomic data. It's not the ideal one function fetch, but it will give you the same information.

    library(biomartr)
    library(Biostrings)
    
    # find your desired genome info
    NCBIsearch <- is.genome.available(db = "refseq", organism = "Borreliella burgdorferi B31", details = TRUE)
    
    # obtain NCBI website downloadable files with the following clipboard link 
    clipr::write_clip(NCBIsearch$ftp_path) 
    
    # copy and paste the NCBI website links for "genomic.gff.gz", "cds_from_genomic.fna.gz", and "genomic.fna.gz"
    download.file('https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/685/GCF_000008685.2_ASM868v2/GCF_000008685.2_ASM868v2_genomic.gff.gz', 
                  'B31.gff.gz')
    download.file('https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/685/GCF_000008685.2_ASM868v2/GCF_000008685.2_ASM868v2_cds_from_genomic.fna.gz', 
                  'B31.cds.gz')   
    download.file('https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/008/685/GCF_000008685.2_ASM868v2/GCF_000008685.2_ASM868v2_genomic.fna.gz', 
                  'B31.fna.gz')
    
    # read in the downloaded files
    B31_GFF <- read_gff(file = "B31.gff.gz")
    B31_CDS <- read_cds(file = "B31.cds.gz")
    B31_FNA <- read_genome(file = "B31.fna.gz")