Shortly about my problem:
I have a computer with 2 sockets of AMD Opteron 6272 and 64GB RAM.
I run one multithread program on all 32 cores and get speed 15% less in comparison with the case when I run 2 programs, each on one 16 cores socket.
How do I make one-program version as fast as two-programs?
More details:
I have a big number of tasks and want to fully load all 32 cores of the system.
So I pack the tasks in groups by 1000. Such a group needs about 120Mb input data, and take about 10 seconds to complete on one core. To make the test ideal I copy these groups 32 times and using ITBB's parallel_for
loop distribute tasks between 32 cores.
I use pthread_setaffinity_np
to insure that system would not make my threads jump between cores. And to insure that all cores are used consequtively.
I use mlockall(MCL_FUTURE)
to insure that system would not make my memory jump between sockets.
So the code looks like this:
void operator()(const blocked_range<size_t> &range) const
{
for(unsigned int i = range.begin(); i != range.end(); ++i){
pthread_t I = pthread_self();
int s;
cpu_set_t cpuset;
pthread_t thread = I;
CPU_ZERO(&cpuset);
CPU_SET(threadNumberToCpuMap[i], &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
TaskManager manager;
for (int j = 0; j < fNTasksPerThr; j++){
manager.SetData( &(InpData->fInput[j]) );
manager.Run();
}
}
}
Only the computing time is important to me therefore I prepare input data in separate parallel_for
loop. And do not include preparation time in time measurements.
void operator()(const blocked_range<size_t> &range) const
{
for(unsigned int i = range.begin(); i != range.end(); ++i){
pthread_t I = pthread_self();
int s;
cpu_set_t cpuset;
pthread_t thread = I;
CPU_ZERO(&cpuset);
CPU_SET(threadNumberToCpuMap[i], &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
InpData[i].fInput = new ProgramInputData[fNTasksPerThr];
for(int j=0; j<fNTasksPerThr; j++){
InpData[i].fInput[j] = InpDataPerThread.fInput[j];
}
}
}
Now I run all these on 32 cores and see speed of ~1600 tasks per second.
Then I create two version of program, and with taskset
and pthread
insure that first run on 16 cores of first socket and second - on second socket. I run them one next to each other using simply &
command in shell:
program1 & program2 &
Each of these programs achieves speed of ~900 tasks/s. In total this are >1800 tasks/s, which is 15% more than one-program version.
What do I miss?
I consider that may be the problem is in libraries, which I load to memory of muster thread only. Can this be a problem? Can I copy libraries data so it would be available independently on both sockets?
I would guess that it's STL/boost memory allocation that's spreading memory for your collections, etc across numa nodes due to the fact that they're not numa aware and you have threads in the program running on each node.
Custom allocators for all of the STL/boost things that you use might help (but is likely a huge job).