I have a huge table (~ 0.8 billion) and I need to fetch data based on some segment condition.
Data:
d_id month_id sec average Class
89 201701 S 5.98 A
73 201703 N 7.63 B
31 201708 F 6.38 P
11 201709 K 6.38 P
I have two lists:
monthList = [201701,201702,201703]
so the sql query is :
sql_query = str("""select * from dbo.table_name where month_id IN monthList;""")
Now I want to hold this data in server side cursor and from that fetch a subset based on a classList
curs = cnxn.cursor('Class')
classList = ['A','B','P']
while True:
records = curs.fetchmany(int(1e3))
if not records:
break
for record in records:
# here I want to use the classList to subset the data , something like
df = pd.DataFrame()
df.append(curs.fetchmany([cursor['Class'] == classList]))
# And once all the records of each Class has been read create csv
df.to_csv("ClassList.csv")
So for the data given above: There will be 3 csv generated: 1. ClassA.csv
d_id month_id sec average Class
31 201708 F 6.38 P
11 201709 K 6.38 P
All data is in PostgreSQL which I am calling using psycopg2
Can someone please help me regarding: 1. Is this even possible to do with server side cursors. 2. I basically need to create a combined csv of each Class from all the data based also on month_id given as a list.
That's not quite how server-side cursors work - they make the server hold state while the client traverses the result set, fetching in batches, possibly reversing the traversal. The benefit is that the server maintains state about the connection so that the client can allocate memory more efficiently (by default, the client tries to fetch everything before allowing your code to iterate. For 8 billion rows, that can cause problems).
But the key to remember is that the data returned by the cursor is determined by the query - you can compare the results of each row to decide what to do, but you are still operating row by row, not changing the results returned by the server. But... your DBA may come after you with violent intent if you abuse the server... holding a server-side cursor for 8 billion rows while making multiple traversals would put a lot of memory pressure on the database, slowing things down for other users.
And the same is true of local system memory via Pandas - based on your example, you aren't really doing anything other than using it to generate the CSV.
What to do?
If you just need to write a large, combined CSV, using psycopg2's native copy_expert
feature to stream directly to CSV is the way to go, combined with a server-side cursor.
I frequently use this approach to create CSVs from large data sets - while keeping both database and client memory flat. It's also faster than any row-by-row CSV generation I could write in straight Python.
Finally, it's not clear that you want 1 CSV, or 3 CSVs. Your final comment references a "combined CSV", so to do that, building on Michael's comment, try something like this:
sql = '''
copy (
select *
from your_table_name
where month_id in (. . .)
and Class in (. . .)
)
to stdout
with (
format csv, header
)'''
stmt = db.cursor('my_cursor')
with open('output.csv', 'w') as outfile:
stmt.copy_expert(sql, outfile)
If you do need 3 separate CSVs, you can modify the approach to do three separate passes.
Hope that helps.
See