c++multithreading parallel-processing openmp

Is it possible to create a team of threads, and then only "use" the threads later?

So I have some OpenMP code:

for(unsigned int it = 0; it < its; ++it)
{
    #pragma omp parallel
    {
        /**
         * Run the position integrator, reset the
         * acceleration, update the acceleration, update the velocity.
         */

          #pragma omp for schedule(dynamic, blockSize)
          for(unsigned int i = 0; i < numBods; ++i)
          {
              Body* body = &bodies[i];
              body->position += (body->velocity * timestep);
              body->position += (0.5 * body->acceleration * timestep * timestep);

              /**
               * Update velocity for half-timestep, then reset the acceleration.
               */
              body->velocity += (0.5f) * body->acceleration * timestep;
              body->acceleration = Vector3();
          }

          /**
           * Calculate the acceleration.
           */
          #pragma omp for schedule(dynamic, blockSize)
          for(unsigned int i = 0; i < numBods; ++i)
          {
              for(unsigned int j = 0; j < numBods; ++j)
              {
                  if(j > i)
                  {
                      Body* body = &bodies[i];
                      Body* bodyJ = &bodies[j];

                    /**
                     * Calculating some of the subsections of the acceleration formula.
                     */
                    Vector3 rij = bodyJ->position - body->position;
                    double sqrDistWithEps = rij.SqrMagnitude() + epsilon2;
                    double oneOverDistCubed = 1.0 / sqrt(sqrDistWithEps * sqrDistWithEps * sqrDistWithEps);
                    double scalar = oneOverDistCubed * gravConst;

                    body->acceleration += bodyJ->mass * scalar * rij;
                    bodyJ->acceleration -= body->mass * scalar * rij; //Newton's Third Law.
                }
            }
        }

        /**
         * Velocity for the full timestep.
         */
        #pragma omp for schedule(dynamic, blockSize)
        for(unsigned int i = 0; i < numBods; ++i)
        {
            bodies[i].velocity += (0.5 * bodies[i].acceleration * timestep);
        }
    }

    /**
     * Don't want I/O to be parallel
     */
    for(unsigned int index = 1; index < bodies.size(); ++index)
    {
        outFile << bodies[index] << std::endl;
    }
}

This is fine, but I can't help but think that forking a team of threads on each iteration is a BAD IDEA. However, the iterations must happen sequentially; so I can't have the iterations themselves being parallel.

I was just wondering if there was a way to set this up to reuse the same team of threads on each iteration?

Solution

As far I know, and it is the most logical approach, the thread pool is already created and every time a thread reach a parallel constructor it will request a team of threads from the pool. Therefore, it will not create a pool of threads every time reaches a parallel region constructor, however if you want to reuse the same threads why not just push the parallel constructor out of the loop and deal with the sequential code by using the single pragma, something like this:

#pragma omp parallel
{
    for(unsigned int it = 0; it < its; ++it)
    {
       ...

          ...

        /**
        * Don't want I/O to be parallel
        */

       #pragma omp single
       {
           for(unsigned int index = 1; index < bodies.size(); ++index)
           {
               outFile << bodies[index] << std::endl;
           }
       } // threads will wait in the internal barrier of the single 
   }
}

I made a quick search and the first paragraph of this answer might depend on the OpenMP implementation that you are using, I highly advice you to read the manual for the one that you are using.

Form exemple, from source:

OpenMP* is strictly a fork/join threading model. In some OpenMP implementations, threads are created at the start of a parallel region and destroyed at the end of the parallel region. OpenMP applications typically have several parallel regions with intervening serial regions. Creating and destroying threads for each parallel region can result in significant system overhead, especially if a parallel region is inside a loop; therefore, the Intel OpenMP implementation uses thread pools. A pool of worker threads is created at the first parallel region. These threads exist for the duration of program execution. More threads may be added automatically if requested by the program. The threads are not destroyed until the last parallel region is executed.

Nevertheless, if you put the parallel region outside the loop you do not have to worry with the potential overhead cited in the above paragraph.