Search code examples
mysqlnode.jssequelize.jsquery-optimization

Is there a better way to compare huge dataset supplied by the user to the entries in the database?


I have 5 million entries in a MySQL database with the following structure

id (primary_key), name, phone_number
1, John Doe, 12346789

.....

My tool scrapes the internet continuously, grabs the data and captures millions of new data. Now that scraped data is passed to the processDataInChunks function with 10000 entries at once.

I am trying to compare the scraped data to the rows in the database and insert it if it doesn't exists. To implement it I have used sequelize in Node.js and here's the code implementation:

// data = chunk of data collected by the scraper
// callback = function to call when new entry is detected
function processDataInChunks(data, callback) {
   data.map(function (entry) { // loop through the data array where entry is nth element of the array
      db.findAll({phone_number: entry.phone_number}) // db variable represents the SQL table
         .then(function (rows) { // call this function with rows as argument when the query is successful
            if (!rows.length > 0) { // if phone number is not in the database
               db.create({ // create entry in the database
                  name: entry.name,
                  phone_number: entry.phone_number
               }).then(function () {
                  callback(entry);
                  console.log(`Found a new phone number: ${entry}`)
               }).catch(err=>console.log(err))

            }
         }).catch(err=>console.log(err))
   })
}

While, running the code I'm getting ConnectionAcquireTimeoutError error. I am assuming that the error is being encountered because all the connections are being consumed and sequelize doesn't have any more to provide to perform new queries. What is the best and fastest way to perform the operation? Please help.

I have tried using async/await but still it takes ages and I don't think it will ever complete.


Solution

  • You seem to be issuing at least two SQL statements per incoming row. By really batching, you can get 10x speedup.

    Use INSERT INTO ... ON DUPLICATE KEY UPDATE ... (aka IODKU or upsert). It avoids having to do the initial select. That speedup of 2x.

    Batch them in clumps of 1000 -- 10000 might be slightly faster, but may run into other issues.

    Are most of the entries unchanged? Or is everything you get going to be either a new entry or an update? There may be some extra optimizations if the data tends to be one way versus the other. (IODKU is happy to handle all 3 cases.)

    Does the data provide the id? Is it consistent? Or is the name the actual clue of which row to update? In this case, what index do you have? Can ther be two different people with the same name? If so, how would you differentiate their rows?

    You could feed the chunks to multiple threads. This would provide some parallelism. Stop at about the number of CPU cores. And do keep the chunk size down at 1000; 10000 is likely to have locking issues, maybe even deadlocks. And do check for errors.

    It is possible to write a single IODKU that handles a thousand rows. Or you could throw all the data into an temp table and work from there.

    If you receive only one row at a time, please spell out the details; an extra step will be needed to collect the data.

    Summary --

    • 10x for batching statements
    • 2x for IODKU
    • 5x(?) for parallelism

    Total might be 100x. Does that sound better?