c++multithreading boost-thread interlocked-increment

Manually writing a multithreaded loop - suboptimal scalability

I have written this test application: it goes through iterations from 0 to 9999, for each integer in the range it calculates some useless but calculation-intensive function. As a result the program outputs the sum of function values. To make it run on several threads I'm using InterlockedIncrement - if after increment the iteration number is <10000 then a thread processes this iteration, otherwise it terminates.

I am wondering why it is not scaling as well as I would like it to. With 5 threads it runs 8s versus 36s with a single thread. This gives ~4.5 scalability. During my experiments with OpenMP (on slightly different problems) I was getting much better scalability.

The source code is shown below.

I am running Windows7 OS on a Phenom II X6 desktop. Don't know what other parameters might be relevant.

Could you please help me explain this sub-optimal scalability? Many thanks.

#include <boost/thread.hpp>
#include <boost/shared_ptr.hpp>
#include <boost/make_shared.hpp>
#include <vector>
#include <windows.h>
#include <iostream>
#include <cmath>

using namespace std;
using namespace boost;

struct sThreadData
{
  sThreadData() : iterCount(0), value( 0.0 ) {}
  unsigned iterCount;
  double value;
};

volatile LONG g_globalCounter;
const LONG g_maxIter = 10000;

void ThreadProc( shared_ptr<sThreadData> data )
{
  double threadValue = 0.0;
  unsigned threadCount = 0;

  while( true )
  {
    LONG iterIndex = InterlockedIncrement( &g_globalCounter );
    if( iterIndex >= g_maxIter )
      break;

    ++threadCount;

    double value = iterIndex * 0.12345777;
    for( unsigned i = 0; i < 100000; ++i )
      value = sqrt( value * log(1.0 + value) );

    threadValue += value;
  }

  data->value = threadValue;
  data->iterCount = threadCount;
}

int main()
{
  const unsigned threadCount = 1;

  vector< shared_ptr<sThreadData> > threadData;
  for( unsigned i = 0; i < threadCount; ++i )
    threadData.push_back( make_shared<sThreadData>() );

  g_globalCounter = 0;

  DWORD t1 = GetTickCount();
  vector< shared_ptr<thread> > threads;
  for( unsigned i = 0; i < threadCount; ++i )
    threads.push_back( make_shared<thread>( &ThreadProc, threadData[i] ) );

  double sum = 0.0;
  for( unsigned i = 0; i < threadData.size(); ++i )
  {
    threads[i]->join();
    sum += threadData[i]->value;
  }

  DWORD t2 = GetTickCount();
  cout << "T=" << static_cast<double>(t2 - t1) / 1000.0 << "s\n";

  cout << "Sum= " << sum << "\n";
  for( unsigned i = 0; i < threadData.size(); ++i )
    cout << threadData[i]->iterCount << "\n";

  return 0;
}

Edit: Attaching sample output of this test program (1 and 5 threads): enter image description here

Solution

It turned out the the results can be explained by the fact that my CPU supports the AMD Turbo Core technology.

While in Turbo CORE mode, the AMD Phenom™ II X6 1090T shifts frequency speed from 3.2GHz on six cores, to 3.6GHz on three cores

So the clock frequencies were not the same in single-threaded mode and multi-threaded mode. I was used to playing aroung with multithreading on CPUs that don't support TurboCore. Below is an image that shows results of

AMD OverDrive utility window (a thing that allows to toggle TurboCore on/off)
a run with 1 threads with TurboCore ON
a run with 1 threads with TurboCore OFF
a run with 5 threads

Many thanks to people who tried to help.