I have a table that has to be populated every hour from the data being pulled out of Greenplum.This table is being stored on the Greenplum server.
So,what I want to know is that what method(python script, windows scheduler or anything) will be suitable for my data(which I guess can be as huge as 60GB or more) should be used for scheduling the query(written in postgreSQL) to be run every hour.
Can someone attach a sample code for the same?
You will want to do a parallel COPY
of ranges of the data out of Greenplum and into PostgreSQL. Make sure PostgreSQL is set up for fast data loading. If possible, use an UNLOGGED
table; otherwise use wal_level = 'minimal'
at least.
How many parallel workers depends on the PostgreSQL server's I/O subsystem more than anything. Test and see.
I would recommend using Python with psycopg2 and the copy_expert
cursor function. See the docs. Use multiprocessing with a pipe to share a file-like object between a reader and writer worker, with the reader connected to greenplum and the writer to PostgreSQL.
So effectively each worker does something a bit like the following shell pseudo-code:
psql -h greenplum-box \
-c "COPY (SELECT * FROM mytable WHERE id BETWEEN 1 AND 10000) TO stdin" \
| \
psql -h postgres-box \
-c "COPY myttable FROM stdin";
(but you connect the two up using pyscopg2, copy_export
, multiprocessing, and a pipe).
Do all the usual fast loading work like creating indexes afterwards. See how to speed up insertion performance in PostgreSQL.
If you have the disk space, create a table like dataload_temp
, populate it, then in one transaction drop the old one and rename the new one to the old one's name. That way there's minimal disruption.
Alternately, take a look at pg_bulkload
for off-line (but non-streaming) bulk data loading.