Search code examples
pythonoracle-database

How to import multiple records with MERGE function to an oracle DB via Python


I am trying to import data from a .csv file into an Oracle DB using Python. So far it works fine if the .csv file contains 10 records. If I increase the number of records in the .csv file to 1.000.000, the script takes far too long and does not end even after an hour.

Can anyone tell me how I can optimise my source code?

King Regards

Jegor

...
sql_insert = """Merge into TEST_TABLE a 
                Using (Select
                           :ID as ID,
                           :COUNTRY as COUNTRY ,
                           :DATE as DATE
                      From Dual) src
                on src.ID = a.ID
                when matched then
                      update set
                           a.COUNTRY = src.COUNTRY,
                           a.DATE = src.DATE
                when not matched then
                       Insert (a.ID, a.COUNTRY, a.DATE)
                       Values (src.ID, src.COUNTRY, src.DATE)"""

# Get S3-File
obj = s3.Object(CDH_S3_Bucket, CDH_Path + '/' + s3_filename)
body = obj.get()['Body'].read().decode('utf-8').splitlines()

# ----------------------------------------------------------
csv_reader = csv.reader(body, delimiter=',')
headings = next(csv_reader)
for line in csv_reader:
    data.append(line)

if data:
    cursor.executemany(sql_insert, data)
    connection.commit()

cursor.close()
...

Solution

  • A merge is meant to modify one table based on the data in another table. It is not intended for single-row processing from the client like this. The proper design would be to use a normal bulk-bind insert to load a work table and then you can do a single merge execution to sync the target table with the work table.

    Also, when you do use a merge (appropriately), you don't want to use the where or delete subclauses within the merge when matched then update... clause. For merges, the where clause is different than in other SQL statements: it is a subprogram within the update program which means you already pay the penalty of CR block reads and other concurrency mechanisms, including redo, even if the where clause cancels the update of a row. You want to filter out unchanged rows within the using clause instead (which means pre-joining to the target table within the using clause and filter out unchanged rows there - you can then emit the target table's ROWID for optimal matching in the outer merge block; but the main benefit is filtering before the DML operation on the target).

    The delete subclause also fires after the update is processed and only on the updated row - if it's being updated, it's because it's coming in your new data and it won't be old by definition, unless you are getting old data in your files. To delete old records you have to use a totally separate archiving SQL that is not the same as your loading SQL.