multithreading ipc multicore shared-memory

multithreading in two processes communicating over shared memory gives a huge slowdown and then core-mapping recovers it

I have a java process and a C++ process talking with each other over a shared memory segment(using jni for java side communication). The code runs with decent speed if there is 1 thread in java and 1 in C++, but as soon as I use multiple threads in the processes the performance drops down a lot(almost 100x times).

After this I was tweaking a few things in the code (which I am running on my core2duo system) and found out that when I mapped the java process on one core say core0 and the cpp process on core1 (using sched_affinity()) the performance recovers back.

Why is this happening? I thought the problem could be cache contention over the shared memory segment but then, this core mapping improves the performance. Also the behavior is observed only if multiple threads are used. If single threads in both processes are used the speeds are normal.

Solution

The "optimal" configuration when you have two threads running at full tilt is to have a core for each. If they aren't moved around i e each thread stays on "its" core you'll have better performance than if they are moved back and forth between the cores. So essentially a 2+2 thread solution will require 4 cores to run optimally.

In addition since two cores are running the same code it is vital that they (in your case) aren't moved from "their" core. This is because the operating environment for both cores is more or less the same which makes switching between them less cumbersome (at the cache level) than if you need to load everything onto a different core.

Then you have the issue of memory system saturation. A "normal" single-threaded program will usually use up most if not all of the available memory bandwidth. Its speed will usually be determined by the rate at which the memory system provides it with data. There are exceptions such as when you're in a division instruction during which no memory activity occurs or when you're in a tight loop which doesn't require data reads or writes. In most other cases the memory system will be working its butt off to shove memory into the program and a lot of the time not as fast as the program can make use of it.

A program which doesn't take this into account will run slower multi-threaded than single because both threads will start colliding when they need memory access and this slows things down a lot. This for compiled lanbguages such as C or C++. With Java there are a lot of memory accesses going on behind the scenes (caused by the engine) over which the programmer has little control. So the Java engine and its workings will use up a lot of the cache memory and bandwidth which will mean that your shared memory will be competing with the engine's needs and be in and out of the cache more or less constantly.

My two cents.