I am trying to import data from a .csv file into an Oracle DB using Python. So far it works fine if the .csv file contains 10 records. If I increase the number of records in the .csv file to 1.000.000, the script takes far too long and does not end even after an hour.
Can anyone tell me how I can optimise my source code?
King Regards
Jegor
...
sql_insert = """Merge into TEST_TABLE a
Using (Select
:ID as ID,
:COUNTRY as COUNTRY ,
:DATE as DATE
From Dual) src
on src.ID = a.ID
when matched then
update set
a.COUNTRY = src.COUNTRY,
a.DATE = src.DATE
when not matched then
Insert (a.ID, a.COUNTRY, a.DATE)
Values (src.ID, src.COUNTRY, src.DATE)"""
# Get S3-File
obj = s3.Object(CDH_S3_Bucket, CDH_Path + '/' + s3_filename)
body = obj.get()['Body'].read().decode('utf-8').splitlines()
# ----------------------------------------------------------
csv_reader = csv.reader(body, delimiter=',')
headings = next(csv_reader)
for line in csv_reader:
data.append(line)
if data:
cursor.executemany(sql_insert, data)
connection.commit()
cursor.close()
...
A merge
is meant to modify one table based on the data in another table. It is not intended for single-row processing from the client like this. The proper design would be to use a normal bulk-bind insert to load a work table and then you can do a single merge execution to sync the target table with the work table.
Also, when you do use a merge (appropriately), you don't want to use the where
or delete
subclauses within the merge when matched then update...
clause. For merges, the where
clause is different than in other SQL statements: it is a subprogram within the update
program which means you already pay the penalty of CR block reads and other concurrency mechanisms, including redo, even if the where
clause cancels the update of a row. You want to filter out unchanged rows within the using
clause instead (which means pre-joining to the target table within the using
clause and filter out unchanged rows there - you can then emit the target table's ROWID
for optimal matching in the outer merge
block; but the main benefit is filtering before the DML operation on the target).
The delete
subclause also fires after the update is processed and only on the updated row - if it's being updated, it's because it's coming in your new data and it won't be old by definition, unless you are getting old data in your files. To delete old records you have to use a totally separate archiving SQL that is not the same as your loading SQL.