for-loop netcdf python-xarray gfs python-siphon

How to using for loop to download GFS url data using Siphon

I'm trying to loop to download a subset of GFS data using the siphon library. I can download one file at a time normally the way the code is laid out. I would like to know how can I download from the period January 2020 to December 2022 from the 003 UTC cycle as shown above until the 168 cycle so that I don't need to download a single file at a time?

from siphon.catalog import TDSCatalog
from siphon.ncss import NCSS
import numpy as np
import ipywidgets as widgets
from datetime import datetime, timedelta
import xarray as xr
from netCDF4 import num2date

# Download SUBSET GFS - Radiação (6 Hour Average) e PBLH

for i in range(6,168,6):
        for day in range(1,32,1):
            for month in range(1,13,1):
                dir_out = '/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)
                if not os.path.exists(dir_out):
                    os.makedirs(dir_out)
                if not os.path.isfile('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc'):
                    catUrl = "https://rda.ucar.edu/thredds/catalog/files/g/ds084.1/2020/2020"+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"/catalog.xml"
                    datasetName = 'gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.grib2'
                    time.sleep(0.01)
                    catalog = TDSCatalog(catUrl)
                    ds = catalog.datasets[datasetName]
                    ds.name
                    ncss = ds.subset()
                    query = ncss.query()
                    query.lonlat_box(east=-30, west=-50, south=-20, north=0)
                    query.variables( 
                        'Downward_Short-Wave_Radiation_Flux_surface_6_Hour_Average',
                        'Planetary_Boundary_Layer_Height_surface').add_lonlat()
                    query.accept('netcdf4')
                    nc = ncss.get_data(query)
                    data = xr.open_dataset(xr.backends.NetCDF4DataStore(nc))
                    data.to_netcdf('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc')

The above script is working for what I need, however after a while downloading the files the code dies with the following error: ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))

What could be happening?

Solution

Unfortunately, there's no way with THREDDS and NCSS to request based on the model reference time, so there's no way to avoid looping over the files.

I will say that this is a TON of data, so at the very least make sure you're being kind to the publicly available server. Downloading close to 3 years' worth of data is something you should do slowly over time and with care so that you don't impact others' use of this shared, free resource. Setting a wait time of 1/100th of a second is, in my opinion, not doing that. I would wait a minimum of 30 seconds between requests if you're going to be requesting this much data.

I'll also add that you can simplify saving the results of the request to a netCDF file--there's no need to go through xarray since the return from the server is already a netCDF file:

...
query.accept('netcdf4')
with open('/home/william/GFS_Siphon/2020'+'{:0>2}'.format(month)+'/gfs.0p25.2020'+'{:0>2}'.format(month)+'{:0>2}'.format(day)+"00.f"+'{:0>3}'.format(i)+'.nc', 'wb') as outfile:
    data = ncss.get_data_raw(query)
    outfile.write(data)