Search code examples
pythonpostgresqlcursorbigdatapsycopg2

Reading huge data using server side cursors with conditions in python


I have a huge table (~ 0.8 billion) and I need to fetch data based on some segment condition.

Data:

d_id    month_id    sec     average Class
89      201701      S       5.98    A
73      201703      N       7.63    B
31      201708      F       6.38    P
11      201709      K       6.38    P

I have two lists:

monthList = [201701,201702,201703]

so the sql query is :

sql_query = str("""select * from dbo.table_name where month_id IN monthList;""") 

Now I want to hold this data in server side cursor and from that fetch a subset based on a classList

curs = cnxn.cursor('Class')
classList = ['A','B','P']

while True:
    records = curs.fetchmany(int(1e3))
    if not records:
      break
    for record in records:
      # here I want to use the classList to subset the data , something like 
      df = pd.DataFrame()
      df.append(curs.fetchmany([cursor['Class'] == classList]))

      # And once all the records of each Class has been read create csv
      df.to_csv("ClassList.csv")

So for the data given above: There will be 3 csv generated: 1. ClassA.csv

d_id    month_id    sec     average Class
31      201708      F       6.38    P
11      201709      K       6.38    P

All data is in PostgreSQL which I am calling using psycopg2

Can someone please help me regarding: 1. Is this even possible to do with server side cursors. 2. I basically need to create a combined csv of each Class from all the data based also on month_id given as a list.


Solution

  • That's not quite how server-side cursors work - they make the server hold state while the client traverses the result set, fetching in batches, possibly reversing the traversal. The benefit is that the server maintains state about the connection so that the client can allocate memory more efficiently (by default, the client tries to fetch everything before allowing your code to iterate. For 8 billion rows, that can cause problems).

    But the key to remember is that the data returned by the cursor is determined by the query - you can compare the results of each row to decide what to do, but you are still operating row by row, not changing the results returned by the server. But... your DBA may come after you with violent intent if you abuse the server... holding a server-side cursor for 8 billion rows while making multiple traversals would put a lot of memory pressure on the database, slowing things down for other users.

    And the same is true of local system memory via Pandas - based on your example, you aren't really doing anything other than using it to generate the CSV.

    What to do?

    If you just need to write a large, combined CSV, using psycopg2's native copy_expert feature to stream directly to CSV is the way to go, combined with a server-side cursor.

    I frequently use this approach to create CSVs from large data sets - while keeping both database and client memory flat. It's also faster than any row-by-row CSV generation I could write in straight Python.

    Finally, it's not clear that you want 1 CSV, or 3 CSVs. Your final comment references a "combined CSV", so to do that, building on Michael's comment, try something like this:

    sql = '''
    copy (
        select * 
          from your_table_name 
         where month_id in (. . .) 
           and Class in (. . .)
    )
    to stdout
    with (
      format csv, header
    )'''
    
    stmt = db.cursor('my_cursor')
    with open('output.csv', 'w') as outfile:
        stmt.copy_expert(sql, outfile)
    

    If you do need 3 separate CSVs, you can modify the approach to do three separate passes.

    Hope that helps.

    See