Search code examples
pythonpostgresqlcursorpsycopg2

postgres python server side cursor won't insert all the data


I have a db with a million row, I want to fetch all the rows and do some operation on them an insert them into another table (newTable).

I figured out I need to use server side cursor, since I can not fetch all data into memory. and I also figured out I need to use two connections so when I commit I dont loose the cursor that I made.

but now my problem is, it wont put all the records into the newTable as it shows in the log. in console log I see it tries to insert 500,000 th record into the database

560530 inserting 20551581 and 2176511

but when I do a count on the created table (while it is doing it) it shows only about 10,000 rows in the new table .

select count(*) from newTable;
 count
-------
 10236

and when the program finishes, I only have about 11000 records in the new table, while in the records it shows it tried to insert at least 2 million rows. whats wrong with my code?

 def fillMyTable(self):
     try:
             self.con=psycopg2.connect(database='XXXX',user='XXXX',password='XXXX',host='localhost')
             cur=self.con.cursor(name="mycursor")
             cur.arraysize=1000
             cur.itersize=2000

             self.con2=psycopg2.connect(database='XXXX',user='XXXX',password='XXXX',host='localhost')
             cur2=self.con2.cursor()

             q="SELECT id,oldgroups from oldTable;"
             cur.execute(q)
             i=0
             while True:
                     batch= cur.fetchmany()
                     if not batch:
                             break
                     for row in batch:
                             userid=row[0]
                             groupids=self.doSomethingOnGroups(row[1])
                             for groupid in groupids:
                                     # insert only if it does NOT exist
                                     i+=1
                                     print (str(i)+" inserting "+str(userid)+" and "+str(groupid))
                                     q2="INSERT INTO newTable (userid, groupid)  SELECT %s, %s   WHERE  NOT EXISTS (     SELECT %s FROM newTable WHERE groupid = %s);"%(userid,groupid,userid,groupid)
                                     cur2.execute(q2)
                             self.con2.commit()
     except psycopg2.DatabaseError, e:
             self.writeLog(e)
     finally:
             cur.close()
             self.con2.commit()
             self.con.close()
             self.con2.close()

Update : I also noticed it uses lots of my RAM, isnt server side cursor supposed not do that?

Cpu(s): 15.2%us, 6.4%sy, 0.0%ni, 56.5%id, 2.8%wa, 0.0%hi, 0.2%si, 18.9%st Mem: 1695220k total, 1680496k used, 14724k free, 3084k buffers Swap: 0k total, 0k used, 0k free, 1395020k cached


Solution

  • If the oldgroups column is in the form 1,3,6,7 this will work:

    insert into newTable (userid, groupid)
    select id, groupid
    from (
        select
            id,
            regexp_split_to_table(olgroups, ',') as groupid
        from oldTable
    ) o
    where
        not exists (
            select 1
            from newTable
            where groupid = o.groupid
        )
        and groupid < 10000000
    

    But I suspect you want to check for the existence of both groupid and id:

    insert into newTable (userid, groupid)
    select id, groupid
    from (
        select
            id,
            regexp_split_to_table(olgroups, ',') as groupid
        from oldTable
    ) o
    where
        not exists (
            select 1
            from newTable
            where groupid = o.groupid and id = o.id
        )
        and groupid < 10000000
    

    The regexp_split_to_table function will "explode" the oldgroups column in rows doing a cross join with the id column.