Search code examples
pythonpostgresqlpostgispsycopg2

Speeding up insertion of point data from netcdf


I've got this netcdf of weather data (one of thousands that require postgresql ingestion). I'm currently capable of inserting each band into a postgis-enabled table at a rate of about 20-23 seconds per band. (for monthly data, there is also daily data that i have yet to test.)

I've heard of different ways of speeding this up using COPY FROM, removing the gid, using ssds, etc... but I'm new to python and have no idea how to store the netcdf data to something I could use COPY FROM or what the best route might be.

If anyone has any other ideas on how to speed this up, please share!

Here is the ingestion script

import netCDF4, psycopg2, time

# Establish connection
db1 = psycopg2.connect("host=localhost dbname=postgis_test user=********** password=********")
cur = db1.cursor()

# Create Table in postgis
print(str(time.ctime()) + " CREATING TABLE")
try:
    cur.execute("DROP TABLE IF EXISTS table_name;")
    db1.commit()
    cur.execute(
        "CREATE TABLE table_name (gid serial PRIMARY KEY not null, thedate DATE, thepoint geometry, lon decimal, lat decimal, thevalue decimal);")
    db1.commit()
    print("TABLE CREATED")
except:
    print(psycopg2.DatabaseError)
    print("TABLE CREATION FAILED")

rawvalue_nc_file = 'netcdf_file.nc'
nc = netCDF4.Dataset(rawvalue_nc_file, mode='r')
nc.variables.keys()

lat = nc.variables['lat'][:]
lon = nc.variables['lon'][:]
time_var = nc.variables['time']
dtime = netCDF4.num2date(time_var[:], time_var.units)
newtime = [fdate.strftime('%Y-%m-%d') for fdate in dtime]
rawvalue = nc.variables['tx_max'][:]

lathash = {}
lonhash = {}
entry1 = 0
entry2 = 0

lattemp = nc.variables['lat'][:].tolist()
for entry1 in range(lat.size):
    lathash[entry1] = lattemp[entry1]

lontemp = nc.variables['lon'][:].tolist()
for entry2 in range(lon.size):
    lonhash[entry2] = lontemp[entry2]

for timestep in range(dtime.size):
    print(str(time.ctime()) + " " + str(timestep + 1) + "/180")
    for _lon in range(lon.size):
        for _lat in range(lat.size):
            latitude = round(lathash[_lat], 6)
            longitude = round(lonhash[_lon], 6)
            thedate = newtime[timestep]
            thevalue = round(float(rawvalue.data[timestep, _lat, _lon] - 273.15), 3)
            if (thevalue > -100):
                cur.execute("INSERT INTO table_name (thedate, thepoint, thevalue) VALUES (%s, ST_MakePoint(%s,%s,0), %s)",(thedate, longitude, latitude, thevalue))
    db1.commit()
cur.close()
db1.close()

print(" Done!")

Solution

  • If you're certain most of the time is spent in PostgreSQL, and not in any other code of your own, you may want to look at the fast execution helpers, namely cur.execute_values() in your case.

    Also, you may want to make sure you're in a transaction, so the database doesn't fall back to an autocommit mode. ("If you do not issue a BEGIN command, then each individual statement has an implicit BEGIN and (if successful) COMMIT wrapped around it.")

    Something like this could do the trick -- not tested though.

    for timestep in range(dtime.size):
        print(str(time.ctime()) + " " + str(timestep + 1) + "/180")
        values = []
    
        cur.execute("BEGIN")
    
        for _lon in range(lon.size):
            for _lat in range(lat.size):
                latitude = round(lathash[_lat], 6)
                longitude = round(lonhash[_lon], 6)
                thedate = newtime[timestep]
                thevalue = round(
                    float(rawvalue.data[timestep, _lat, _lon] - 273.15), 3
                )
                if thevalue > -100:
                    values.append((thedate, longitude, latitude, thevalue))
    
        psycopg2.extras.execute_values(
            cur,
            "INSERT INTO table_name (thedate, thepoint, thevalue) VALUES %s",
            values,
            template="(%s, ST_MakePoint(%s,%s,0), %s)"
        )
        db1.commit()