Search code examples
python-3.xpostgresqllistpsycopg2server-side

Psycopg2 - How to Do Large Query - 1 Million Plus Rows


So i'm trying to do a massive query from a postgresql database that I have locally.

The 'processrecords' function, returns a listofmalwareobjects, i'm assuming i'll need to add a new list to a master list every time the server side cursor is run.

How would I do this properly as i'm a little confused.

I think I need to use server side cursors as otherwise the program runs out of memory with Psycopg2. But i've heard good things about server side cursors.

with connection:

    cursor = connection.cursor()
    with cursor:
        cursor.itersize = 20000
        cursor.execute("SELECT malware_id, malwarehashmd5, malwarehashsha1, malwarehashsha256g FROM malwarehashesandstrings")
        listoffetchedmalware = cursor.fetchall()


        listofmalwareobjects = processrecords(listoffetchedmalware)

Solution

  • For the documentation:

    Psycopg wraps the database server side cursor in named cursors. A named cursor is created using the cursor() method specifying the name parameter.

    Try:

    with connection:
    
        cursor = connection.cursor('my_cursor')
        ...
    

    However, fetchall() will still return all rows at once. If you want to process the data in buckets, use fetchmany() in a loop, e.g.

    with connection.cursor(name="my_cursor") as cursor:
    
        cursor.itersize = 20000
        cursor.execute(the_query)
        
        listoffetchedmalware = cursor.fetchmany(cursor.itersize)
        
        while len(listoffetchedmalware) > 0:
        
            listofmalwareobjects = processrecords(listoffetchedmalware)
            listoffetchedmalware = cursor.fetchmany(cursor.itersize)