I'm currently working on a project with Ruby/Rails, importing invoices to the database but trying to maximise the efficiency of the processes which is indeed too slow right now.
For an import batch with 100.000 rows it takes around 2.5 3 hours to process and save each record in the database.
//// Ruby code
class DeleteImportStrategy
def pre_process(merchant_prefix, channel_import)
# channel needed to identify invoices so an import from another channel cannot collude if they had same merchant_prefix
Jzbackend::Invoice.where(merchant_prefix: merchant_prefix, channel: channel_import.channel).delete_all
# get rid of all previous import patches which becomes empty after delete_import_strategy
Jzbackend::Import.where.not(id: channel_import.id).where(channel: channel_import.channel).destroy_all
end
def process_row(row, channel_import)
debt_claim = Jzbackend::Invoice.new
debt_claim.import = channel_import
debt_claim.status = 'pending'
debt_claim.channel = channel_import.channel
debt_claim.merchant_prefix = row[0]
debt_claim.debt_claim_number = row[1]
debt_claim.amount = Monetize.parse(row[2])
debt_claim.print_date = row[3]
debt_claim.first_name = row.try(:[], 4)
debt_claim.last_name = row.try(:[], 5)
debt_claim.address = row.try(:[], 6)
debt_claim.postal_code = row.try(:[], 7)
debt_claim.city = row.try(:[], 8)
debt_claim.save
end
end
////
So for the each import batch that comes in as CSV, I get rid of previous batches and start to import new ones by reading each row and inserting it to the new Import as Invoice records. However, 2.5-3 hours for 100.000 entries seems a bit overkill. How can I optimise this process as i'm sure it's definitely not efficient this way.
Edited: So It has been long since I have posted this but just to note, I ended up to use activerecord-import library which works pretty well since then. However, note that it's :on_duplicate_key_update functionality is only available in PostgreSQL v9.5+.
First rule of mass imports: batch, batch, batch.
You're saving each row separately. This incurs HUGE overhead. Say, the insert itself takes 1ms, but the roundtrip to the database is 5ms. Total time used - 6ms. For 1000 records that's 6000ms or 6 seconds.
Now imagine that you use a mass insert, where you send data for multiple rows in the same statement. It looks like this:
INSERT INTO users (name, age)
VALUES ('Joe', 20), ('Moe', 22), ('Bob', 33'), ...
Let's say, you send data for 1000 rows in this one request. The request itself takes 1000ms (but in reality it'll likely be considerably quicker too, less overhead on parsing the query, preparing the execution plan, etc.). Total time taken is 1000ms + 5ms. At least 6x reduction! (in real projects of mine, I was observing 100x-200x reduction).