Search code examples
postgresqlschedulinggreenplum

Scheduling a postgreSQL query operating on greenplum DB


I have a table that has to be populated every hour from the data being pulled out of Greenplum.This table is being stored on the Greenplum server.

So,what I want to know is that what method(python script, windows scheduler or anything) will be suitable for my data(which I guess can be as huge as 60GB or more) should be used for scheduling the query(written in postgreSQL) to be run every hour.

Can someone attach a sample code for the same?


Solution

  • You will want to do a parallel COPY of ranges of the data out of Greenplum and into PostgreSQL. Make sure PostgreSQL is set up for fast data loading. If possible, use an UNLOGGED table; otherwise use wal_level = 'minimal' at least.

    How many parallel workers depends on the PostgreSQL server's I/O subsystem more than anything. Test and see.

    I would recommend using Python with psycopg2 and the copy_expert cursor function. See the docs. Use multiprocessing with a pipe to share a file-like object between a reader and writer worker, with the reader connected to greenplum and the writer to PostgreSQL.

    So effectively each worker does something a bit like the following shell pseudo-code:

    psql -h greenplum-box \
         -c "COPY (SELECT * FROM mytable WHERE id BETWEEN 1 AND 10000) TO stdin" \
      | \
    psql -h postgres-box \
         -c "COPY myttable FROM stdin";
    

    (but you connect the two up using pyscopg2, copy_export, multiprocessing, and a pipe).

    Do all the usual fast loading work like creating indexes afterwards. See how to speed up insertion performance in PostgreSQL.

    If you have the disk space, create a table like dataload_temp, populate it, then in one transaction drop the old one and rename the new one to the old one's name. That way there's minimal disruption.

    Alternately, take a look at pg_bulkload for off-line (but non-streaming) bulk data loading.