Search code examples
pythonpython-3.xdatasetnetcdf4

How to read large NetCDF data sets without using a for - Python


Good morning, I have a problem when reading a large netCDF file in python, which contains meteorological information, that information must go through it to assemble the information and then insert it into the database, but the time it takes to go through and assemble the information is too much, I know there must be other ways to perform the same process more efficiently, currently I access the information through a for loop, below the code

 content = nc.Dataset(pathFile+file)
 XLONG, XLAT = content.variables["XLONG"], content.variables["XLAT"]
 Times = content.variables["Times"]  #Horas formar b 'b
 RAINC  =  content.variables["RAINC"] #Lluvia
 Q2 = content.variables["Q2"] #Humedad especifica
 T2 = content.variables["T2"] #Temperatura
 U10 = content.variables["U10"] #Viento zonal
 V10 = content.variables["V10"] #Viento meridional
 SWDOWN = content.variables["SWDOWN"] #Radiacion incidente
 PSFC = content.variables["PSFC"] #Presion de la superficie
 SST = content.variables["SST"] #Temperatura de la superficie del mar
CLDFRA = content.variables["CLDFRA"] #Fraccion de nubes

 for c2 in range(len(XLONG[0])):
    for c3 in range(len(XLONG[0][c2])):
    position += 1  
    for hour in range(len(Times)):
        dateH = getDatetimeInit(dateFormatFile.hour) if hour == 0 else getDatetimeForHour(hour, dateFormatFile.hour)
        hourUTC = getHourUTC(hour)        

        RAINH = str(RAINC[hour][0][c2][c3])
        Q2H = str(Q2[hour][0][c2][c3])
        T2H = str(convertKelvinToCelsius(T2[hour][0][c2][c3]))
        U10H = str(U10[hour][0][c2][c3])
        V10H = str(V10[hour][0][c2][c3])
        SWDOWNH = str(SWDOWN[hour][0][c2][c3])
        PSFCH = str(PSFC[hour][0][c2][c3])
        SSTH = str(SST[hour][0][c2][c3])
        CLDFRAH = str(CLDFRA[hour][0][c2][c3] )


        rowData = [idRun, functions.IDMODEL, idTime, position, dateH.year, dateH.month, dateH.day, dateH.hour, RAINH, Q2H, T2H, U10H, V10H, SWDOWNH, PSFCH, SSTH, CLDFRAH]           
        dataProcess.append(rowData)

Solution

  • I would use NumPy. Let us assume you have netCDF with 2 variables, "t2" and "slp". Then you could use the following code to vectorize your data:

    #!//usr/bin/env ipython
    # ---------------------
    import numpy as np
    from netCDF4 import Dataset
    # ---------------------
    filein = 'test.nc'
    ncin = Dataset(filein);
    tair = ncin.variables['t2'][:];
    slp  = ncin.variables['slp'][:];
    ncin.close();
    # -------------------------
    tairseries = np.reshape(tair,(np.size(tair),1));
    slpseries =  np.reshape(slp,(np.size(slp),1));
    # --------------------------
    ## if you want characters:
    #tairseries = np.array([str(val) for val in tairseries]);
    #slpseries = np.array([str(val) for val in slpseries]);
    # --------------------------
    rowdata = np.concatenate((tairseries,slpseries),axis=1);
    # if you want characters, do this in the end:
    row_asstrings = [[str(vv) for vv in val] for val in rowdata]
    # ---------------------------
    

    Nevertheless, I have a feeling that using strings is not very good idea. In my example, the conversion from numerical arrays to strings, took quite long time and therefore I did not implement it before concatenation.

    If you want also some time/location information, you can do like this:

    #!//usr/bin/env ipython
    # ---------------------
    import numpy as np
    from netCDF4 import Dataset
    # ---------------------
    filein = 'test.nc'
    ncin = Dataset(filein);
    xin = ncin.variables['lon'][:]
    yin = ncin.variables['lat'][:]
    timein = ncin.variables['time'][:]
    tair = ncin.variables['t2'][:];
    slp  = ncin.variables['slp'][:];
    ncin.close();
    # -------------------------
    tairseries = np.reshape(tair,(np.size(tair),1));
    slpseries =  np.reshape(slp,(np.size(slp),1));
    # --------------------------
    ## if you want characters:
    #tairseries = np.array([str(val) for val in tairseries]);
    #slpseries = np.array([str(val) for val in slpseries]);
    # --------------------------
    rowdata = np.concatenate((tairseries,slpseries),axis=1);
    # if you want characters, do this in the end:
    #row_asstrings = [[str(vv) for vv in val] for val in rowdata]
    # ---------------------------
    # =========================================================
    nx = np.size(xin);ny = np.size(yin);ntime = np.size(timein);
    xm,ym = np.meshgrid(xin,yin);
    xmt = np.tile(xm,(ntime,1,1));ymt = np.tile(ym,(ntime,1,1))
    timem = np.tile(timein[:,np.newaxis,np.newaxis],(1,ny,nx));
    xvec = np.reshape(xmt,(np.size(tair),1));yvec = np.reshape(ymt,(np.size(tair),1));timevec = np.reshape(timem,(np.size(tair),1)); # to make sure that array's size match, I am using the size of one of the variables
    rowdata = np.concatenate((xvec,yvec,timevec,tairseries,slpseries),axis=1);
    

    In any case, with variable sizes (744,150,150), it took less than 2 seconds to vectorize 2 variables.