Search code examples
recursioncurlhyperlinkdownloadwget

How can I use wget to download a list of links from NOAA NCEI?


I need to download public bathymetry data from NOAA NCEI. Often, this means I need to download hundreds of small files to later be stitched together. NOAA NCEI has a tool for this -- "request files": click request, enter your email, and wait for them to send you a zipped folder of all the files. This can take >1 week, and sometimes requests fail without your knowledge. I would like to avoid that method. Below is an example data source:

https://www.ngdc.noaa.gov/ships/nautilus/NA072_mb.html

Ultimately, I would like to use wget/curl to download every .gz file from an NCEI page such as that. I noticed that on the page all the file links are present. You can right click->open in new window->enter to download a single file immediately. If you do this, it redirects to a link like this:

http://data.ngdc.noaa.gov/platforms/ocean/ships/nautilus/NA072/multibeam/data/version2/MB/em302/0000_20160601_180321_Nautilus_EM302.gsf.mb121.gz

How can I use a command line tool like wget to download all .gz files from a page like this?

I have tried commands such as:

wget --execute="robots = off" -A.gz --mirror --convert-links --no-parent [url]

but I get one of two errors for the .gz files:

"Unable to establish SSL connection."

or

"HTTP request sent, awaiting response... 301 Moved Permanently"


Solution

  • You are accessing https://www.ngdc.noaa.gov/ships/nautilus/NA072_mb.html but links are like http://data.ngdc.noaa.gov/platforms/ocean/ships/nautilus/NA072/multibeam/data/version2/MB/em302/0000_20160601_180321_Nautilus_EM302.gsf.mb121.gz so they are different hosts, by default wget does not crawl to other hosts, you should combine --recursive with -H to do so, consult Spanning Hosts for more information.

    I suggest using following command

    wget --recursive --level=1 -H --accept gz 'https://www.ngdc.noaa.gov/ships/nautilus/NA072_mb.html'
    

    beyond mentioned --recursive -H combination I limited level (depth) to 1, meaning that I want only links present at given page (not links in linked page &c) and to gz files. Please try running that command and write if it does downloaded desired files.