I have a large set of jobs to run (thousands), each one takes between 30 minutes to a few hours on a single CPU. The memory requirements are small (few KB each). I'm working on a small linux cluster that has a few dozen CPUs. So far, I've been starting them running a few at a time, trying to manually keep the cluster busy.
My question is: what happens if I submit hundreds or thousands at once -- far more than the number of CPUs? It's clear that each job will take longer to run individually, but I am wondering about the overall efficiency of this method vs. having exactly one job per CPU at a time. I could also write a more complicated method to monitor the progress and keep each CPU occupied with exactly one job (e.g. using multiprocessing in Python), but this would take up costly programmer time, and I'm wondering whether the end result would really be any faster.
Speed-wise, you're unlikely to get a performance boost spawning more threads than there are physical threads available unless your threads are spending a lot of time sleeping (in which case it gives your other threads an opportunity to execute). Note that thread sleeps can be implicit and hidden in I/O bound processes and when contending a lock.
It really depends on whether your threads are spending most of their time waiting for something (ex: more data to come from a server, for users to do something, for a file to update, to get access to a locked resource) or just going as fast as they can in parallel. If the latter case, using more threads than physically available will tend to slow you down. The only way having more threads than tasks can ever help throughput is when those threads waste time sleeping, yielding opportunities for other threads to do more while they sleep.
However, it might make things easier for you to just spawn all these tasks and let the operating system deal with the scheduling.
With vastly more threads, you could slow things down potentially (even in terms of throughput). It depends somewhat on how your scheduling and thread pools work and whether those threads spend time sleeping, but a thread is not necessarily a cheap thing to construct, and a context switch with that many threads can become more expensive than your own scheduling process which can have a lot more information about exactly what you want to do and when it's appropriate than the operating system who just sees a boatload of threads that need to be executed.
There's a reason why efficient libraries like Intel's Thread Building Blocks matches the number of threads in the pool to the physical hardware (no more, no less). It tends to be the most efficient route, but it's the most awkward to implement given the need for manual scheduling, work stealing, etc. So sometimes it can be convenient to just spawn a boatload of threads at once, but you typically don't do that as an optimization unless you're I/O bound as pointed out in the other answer and your threads are just spending most of their time sleeping and waiting for input.
If you have needs like this, the easiest way to get the most out of it is to find a good parallel processing library (ex: PPL, TBB, OMP, etc). Then you just write a parallel loop and let the library focus on how to most efficiently deal with the threads and to balance the load between them. With those kinds of cases, you focus on what tasks should do but not necessarily when they execute.