Most efficient way to spawn n pthreads with the same parameters in C

I have 32 threads that I know the input parameters to ahead of time, nothing changes inside the function (other than the memory buffer that each thread interacts with).

In pseudo C code this is my design pattern:

// declare 32 pthreads as global variables

void dispatch_32_threads() {
   for(int i=0; i < 32; i++) {
      pthread_create( &thread_id[i], NULL, thread_function, (void*) thread_params[i] );
   }
   // wait until all 32 threads are finished
   for(int j=0; j < 32; j++) {
      pthread_join( thread_id[j], NULL); 
   }
}

int main (crap) {

    //init 32 pthreads here

    for(int n = 0; n<4000; n++) {
        for(int x = 0; x<100< x++) {
            for(int y = 0; y<100< y++) {
                dispatch_32_threads();
                //modify buffers here
            }
        }
    }
}

I am calling dispatch_32_threads 100*100*4000= 40000000 times. thread_function and (void*) thread_params[i] do not change. I think pthread_create keeps creating and destroying threads, I have 32 cores, none of them are at 100% utilization, it hovers around 12%. Moreover, when I reduce the number of threads to 10, all 32 cores remain at 5-7% utilization, and I see no slow down in runtime. Running less than 10 slow things down.

Running 1 thread however is extremely slow, so multi threading is helping. I profiled my code, I know it's thread_func that is slow, and thread_func is parallelizable. This leads me to believe that pthread_create keeps spawning and destroying threads on different cores, and after 10 threads I lose efficiency, and it gets slower, thread_func is in essence "less complicated" than spawning more than 10 threads.

Is this assessment true? What is the best way to utilize 100% of all cores?

Solution

Thread creation is expensive. It depends on different parameters, but is rarely below 1000 cycles. And thread synchronisation and destruction is similar. If the amount of work in your thread_function is not very high it will largely dominate the computation time.

It is rarely a good idea to create threads in the inner loops. Probably, the best is to create threads to process iterations of the outer loop. Depending on your program and on what does the thread_function there may be dependencies between iterations and this may require some rewriting, but a solution could be:

int outer=4000;
int nthreads=32;
int perthread=outer/nthreads;

// add an integer with thread_id to thread_param struct
void thread_func(whatisrequired *thread_params){
  // runs perthread iteration of the loop beginning at start
    int start = thread_param->thread_id;
    for(int n = start; n<start+perthread; n++) {
        for(int x = 0; x<100< x++) {
            for(int y = 0; y<100< y++) {
                //do the work
            }
        }
    }
}

int main(){
   for(int i=0; i < 32; i++) {
      thread_params[i]->thread_id=i;
      pthread_create( &thread_id[i], NULL, thread_func, 
              (void*) thread_params[i]);
   }
   // wait until all 32 threads are finished
   for(int j=0; j < 32; j++) {
      pthread_join( thread_id[j], NULL); 
   }
}

With this kind of parallelization, you can consider using openmp. The parallel for clause will make you easily experiment with the best parallelization scheme.

If there are dependencies and such an obvious parallelization is not possible, you can create threads at program start and give them work by managing a thread pool. Managing queues is less expensive than thread creation (but atomic accesses do have a cost).

Edit: Alternatively, you can
1. put all you loops in the thread function
2. at the start (or the end) of the inner loop add a barrier to synchronize your threads. This will ensure that all threads have finished their job.
3. In the main create all the threads and wait for completion.
Barriers are less expensive than thread creation and the result will be identical.