Search code examples
databasesearchparallel-processinglarge-data

Processing large amounts of data quickly


I'm working on a web application where the user provides parameters, and these are used to produce a list of the top 1000 items from a database of up to 20 million rows. I need all top 1000 items at once, and I need this ranking to happen more or less instantaneously from the perspective of the user.

Currently, I'm using a MySQL with a user-defined function to score and rank the data, then PHP takes it from there. Tested on a database of 1M rows, this takes about 8 seconds, but I need performance around 2 seconds, even for a database of up to 20M rows. Preferably, this number should be lower still, so that decent throughput is guaranteed for up to 50 simultaneous users.

I am open to any process with any software that can process this data as efficiently as possible, whether it is MySQL or not. Here are the features and constraints of the process:

  • The data for each row that is relevant to the scoring process is about 50 bytes per item.
  • Inserts and updates to the DB are negligible.
  • Each score is independent of the others, so scores can be computed in parallel.
  • Due to the large number of parameters and parameter values, the scores cannot be pre-computed.
  • The method should scale well for multiple simultaneous users
  • The fewer computing resources this requires, in terms of number of servers, the better.

Thanks


Solution

  • A feasible approach seems to be to load (and later update) all data into about 1GB RAM and perform the scoring and ranking outside MySQL in a language like C++. That should be faster than MySQL.

    The scoring must be relatively simple for this approache because your requirements only leave a tenth of a microsecond per row for scoring and ranking without parallelization or optimization.