Search code examples
algorithmproducer-consumergreedyknapsack-problem

Searching an algorithm similar to producer-consumer


I would like to ask if someone would have an idea on the best(fastest) algorithm for the following scenario:

  • X processes generate a list of very large files. Each process generates one file at a time
  • Y processes are being notified that a file is ready. Each Y process has its own queue to collect the notifications
  • At a given time 1 X process will notify 1 Y process through a Load Balancer that has the Round Rubin algorithm
  • Each file has a size and naturally, bigger files will keep both X and Y more busy

Limitations

  • Once a file gets on a Y process it would be impractical to remove it and move it to another Y process.

I can't think of other limitations at the moment.

Disadvantages to this approach

  • sometimes X falls behind(files are no longer pushed). It's not really impacted by the queueing system and no matter if I change it it will still have slow/good times.
  • sometimes Y falls behind(a lot of files gather in the queues). Again, the same thing like before.
  • 1 Y process is busy with a very large file. It also has several small files in its queue that could be taken on by other Y processes.
  • The notification itself is through HTTP and seems somehow unreliable sometimes. Notifications fail and debugging has not revealed anything.

There are some more details that would help to see the picture more clearly.

  • Y processes are DB threads/jobs
  • X processes are web apps
  • Once files reach the X processes, these would also burn resources from the DB side by querying it. It has an impact on the producing part

Now I considered the following approach:

  • X will produce files like it has before but will not notify Y. It will hold a buffer (table) to populate the file list
  • Y will constantly search for files in the buffer and retrieve them itself and store them in its own queue.

Now would this change be practical? Like I said, each Y process has its own queue, it doesn't seem to be efficient to keep it anymore. If so, then I'm still undecided on the next bit:

How to decide which files to fetch

I've read through the knapsack problem and I think that has application if I would have the entire list of files from the beginning which I don't. Actually, I do have the list and the size of each file but I wouldn't know when each file would be ready to be taken.

I've gone through the producer-consumer problem but that centers around a fixed buffer and optimising that but in this scenario the buffer is unlimited and I don't really care if it is large or small.

The next best option would be a greedy approach where each Y process locks on the smallest file and takes it. At first it does appear to be the fastest approach and I'm currently building a simulation to verify that but a second opinion would be fantastic.

Update Just to be sure that everyone gets the big picture, I'm linking here a fast-done diagram.

  • Jobs are independent from Processes. They will run at a speed and process how many files are possible.
  • When a Job finishes with a file it will send a HTTP request to the LB
  • Each process queues requests (files) coming from the LB
  • The LB works on a round robin rule

Diagram


Solution

  • The current LB idea is not good

    The load balancer as you've described it is a bad idea because it's needlessly required to predict the future, which you are saying is impossible. Moreover, round-robin is a terrible scheduling strategy when jobs have varying lengths.

    Just have consumers notify the LB when they're idle. When a new job arrives from a producer, it selects the first idle consumer and sends the job there. When there are no idle consumers, the producer's job is queued in the LB waiting for a free consumer to appear.

    This way consumers will always be optimally busy.

    You say "Having one queue to serve 100 apps (for example) would be inefficient." This is a huge leap of intuition that's probably wrong. A work queue that's only handling file names can be fast. You need it only to be 100 times faster (because you infer there are 100 consumers) than the average "very large file" handling operation. File handling is normally 10th of seconds or seconds. A queue handler based, say, on an Apache mod or Redis for two random choices, could pretty easily serve 10,000 requests per second. This is a factor of 10 away from being a bottleneck.

    If you select from idle consumers on a FIFO basis, the behavior will be round-robin when all jobs are equal length.

    If the LB absolutelly cannot queue work

    Then let Ty(t) be the total future time needed to complete the work in the queue of consumer y at the current epoch t. The LB's goal is to make Ty(t) values equal for all y and t. This is the ideal.

    To get as close as possible to the ideal, it needs an internal model to compute these Ty(t) values. When a new job arrives from a producer at epoch t, it finds consumer y with the the minimum Ty(t) value, assigns the job to this y, and adjusts the model accordingly. This is a variation of the "least time remaining" scheduling strategy, which is optimal for this situation.

    The model must inevitably be an approximation. The quality of the approximation will determine its usefulness.

    A standard approach (e.g. from OS scheduling), will be to maintain a pair [t, T]_y for each consumer y. T is the estimate of Ty(t) that was computed at the past epoch t. Thus at a later epoch t+d, we can estimate Ty(t+d) as max(T-t,0). The max is because for d>t, the estimated job time has expired, so the consumer should be complete.

    The LB uses whatever information it can get to update the model. Examples are estimates of time a job will require (from your description probably based on file size and other characteristics), notification that the consumer has actually finished a job (LB decreases T by the esimated duration of the completed job and updates t), assignment of a new job (LB increases T by the estimated duration of the new job and updates t), and intermediate progress updates of estimated time remaining from consumers during long jobs.

    If the information available to the LB is detailed, you will want to replace the total time T in the [t, T]_y pair with a more complete model of the work queued at y: for example a list of estimated job durations, where the head of the list is the one currently being executed.

    The more accurate the LB model, the less likely a consumer will starve when work is available, which is what you are trying to avoid.