Search code examples
for-loopnetcdfpython-xarraygfspython-siphon

How to using for loop to download GFS url data using Siphon


I'm trying to loop to download a subset of GFS data using the siphon library. I can download one file at a time normally the way the code is laid out. I would like to know how can I download from the period January 2020 to December 2022 from the 003 UTC cycle as shown above until the 168 cycle so that I don't need to download a single file at a time?

from siphon.catalog import TDSCatalog
from siphon.ncss import NCSS
import numpy as np
import ipywidgets as widgets
from datetime import datetime, timedelta
import xarray as xr
from netCDF4 import num2date

# Download SUBSET GFS - Radiação (6 Hour Average) e PBLH

for i in range(6,168,6):
        for day in range(1,32,1):
            for month in range(1,13,1):
                dir_out = '/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)
                if not os.path.exists(dir_out):
                    os.makedirs(dir_out)
                if not os.path.isfile('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc'):
                    catUrl = "https://rda.ucar.edu/thredds/catalog/files/g/ds084.1/2020/2020"+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"/catalog.xml"
                    datasetName = 'gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.grib2'
                    time.sleep(0.01)
                    catalog = TDSCatalog(catUrl)
                    ds = catalog.datasets[datasetName]
                    ds.name
                    ncss = ds.subset()
                    query = ncss.query()
                    query.lonlat_box(east=-30, west=-50, south=-20, north=0)
                    query.variables( 
                        'Downward_Short-Wave_Radiation_Flux_surface_6_Hour_Average',
                        'Planetary_Boundary_Layer_Height_surface').add_lonlat()
                    query.accept('netcdf4')
                    nc = ncss.get_data(query)
                    data = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
                    data.to_netcdf('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc')

The above script is working for what I need, however after a while downloading the files the code dies with the following error: ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

What could be happening?


Solution

  • Unfortunately, there's no way with THREDDS and NCSS to request based on the model reference time, so there's no way to avoid looping over the files.

    I will say that this is a TON of data, so at the very least make sure you're being kind to the publicly available server. Downloading close to 3 years' worth of data is something you should do slowly over time and with care so that you don't impact others' use of this shared, free resource. Setting a wait time of 1/100th of a second is, in my opinion, not doing that. I would wait a minimum of 30 seconds between requests if you're going to be requesting this much data.

    I'll also add that you can simplify saving the results of the request to a netCDF file--there's no need to go through xarray since the return from the server is already a netCDF file:

    ...
    query.accept('netcdf4')
    with open('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc', 'wb') as outfile:
        data = ncss.get_data_raw(query)
        outfile.write(data)