I would like to ask if someone would have an idea on the best(fastest) algorithm for the following scenario:
Limitations
I can't think of other limitations at the moment.
Disadvantages to this approach
There are some more details that would help to see the picture more clearly.
Now I considered the following approach:
Now would this change be practical? Like I said, each Y process has its own queue, it doesn't seem to be efficient to keep it anymore. If so, then I'm still undecided on the next bit:
How to decide which files to fetch
I've read through the knapsack problem and I think that has application if I would have the entire list of files from the beginning which I don't. Actually, I do have the list and the size of each file but I wouldn't know when each file would be ready to be taken.
I've gone through the producer-consumer problem but that centers around a fixed buffer and optimising that but in this scenario the buffer is unlimited and I don't really care if it is large or small.
The next best option would be a greedy approach where each Y process locks on the smallest file and takes it. At first it does appear to be the fastest approach and I'm currently building a simulation to verify that but a second opinion would be fantastic.
Update Just to be sure that everyone gets the big picture, I'm linking here a fast-done diagram.
The current LB idea is not good
The load balancer as you've described it is a bad idea because it's needlessly required to predict the future, which you are saying is impossible. Moreover, round-robin is a terrible scheduling strategy when jobs have varying lengths.
Just have consumers notify the LB when they're idle. When a new job arrives from a producer, it selects the first idle consumer and sends the job there. When there are no idle consumers, the producer's job is queued in the LB waiting for a free consumer to appear.
This way consumers will always be optimally busy.
You say "Having one queue to serve 100 apps (for example) would be inefficient." This is a huge leap of intuition that's probably wrong. A work queue that's only handling file names can be fast. You need it only to be 100 times faster (because you infer there are 100 consumers) than the average "very large file" handling operation. File handling is normally 10th of seconds or seconds. A queue handler based, say, on an Apache mod or Redis for two random choices, could pretty easily serve 10,000 requests per second. This is a factor of 10 away from being a bottleneck.
If you select from idle consumers on a FIFO basis, the behavior will be round-robin when all jobs are equal length.
If the LB absolutelly cannot queue work
Then let Ty(t) be the total future time needed to complete the work in the queue of consumer y at the current epoch t. The LB's goal is to make Ty(t) values equal for all y and t. This is the ideal.
To get as close as possible to the ideal, it needs an internal model to compute these Ty(t) values. When a new job arrives from a producer at epoch t, it finds consumer y with the the minimum Ty(t) value, assigns the job to this y, and adjusts the model accordingly. This is a variation of the "least time remaining" scheduling strategy, which is optimal for this situation.
The model must inevitably be an approximation. The quality of the approximation will determine its usefulness.
A standard approach (e.g. from OS scheduling), will be to maintain a pair [t, T]_y for each consumer y. T is the estimate of Ty(t) that was computed at the past epoch t. Thus at a later epoch t+d, we can estimate Ty(t+d) as max(T-t,0). The max is because for d>t, the estimated job time has expired, so the consumer should be complete.
The LB uses whatever information it can get to update the model. Examples are estimates of time a job will require (from your description probably based on file size and other characteristics), notification that the consumer has actually finished a job (LB decreases T by the esimated duration of the completed job and updates t), assignment of a new job (LB increases T by the estimated duration of the new job and updates t), and intermediate progress updates of estimated time remaining from consumers during long jobs.
If the information available to the LB is detailed, you will want to replace the total time T in the [t, T]_y pair with a more complete model of the work queued at y: for example a list of estimated job durations, where the head of the list is the one currently being executed.
The more accurate the LB model, the less likely a consumer will starve when work is available, which is what you are trying to avoid.