I have been able to extract close to 3.5 mil rows from a postgres table using python and write to a file. However the process is extremely slow and I'm sure not the most efficient. Following is my code:
import psycopg2, time,csv
conn_string = "host='compute-1.amazonaws.com' dbname='re' user='data' password='reck' port=5433"
conn = psycopg2.connect(conn_string)
cursor = conn.cursor()
quert = '''select data from table;'''
cursor.execute(quert)
def get_data():
while True:
recs = cursor.fetchmany(10000)
if not recs:
break
for columns in recs:
# do transformation of data here
yield(columns)
solr_input=get_data()
with open('prc_ind.csv','a') as fh:
for i in solr_input:
count += 1
if count % 1000 == 0:
print(count)
a,b,c,d = i['Skills'],i['Id'],i['History'],i['Industry']
fh.write("{0}|{1}|{2}|{3}\n".format(a,b,c,d))
The table has about 8 mil rows. I want to ask is there is a better, faster and less memory intensive way to accomplish this.
I can see four fields, so I'll assume you are selecting only these.
But even then, you are still loading 8 mil x 4 x n Bytes of data from what seems to be another server. So yes it'll take some time.
Though you are trying to rebuild the wheel, why not use the PostgreSQL client?
psql -d dbname -t -A -F"," -c "select * from users" > output.csv