C - __declspec(thread) variables performances

I'm working on the multithreading implementation of a library. In one module of this library there are some global variables (very often used in the program execution). In order to make the access to those variables more safe, I declared them using the Thread-local storage (TLS) keyword __declspec(thread).

Here is the call to the library external function. This function uses the module with the global variables:

for(i = 0; i<n_cores; i++)
    hth[i] = (HANDLE)_beginthread((void(*)(void*))MT_Interface_DimenMultiCells,0,(void*)&inputSet[i]);

In this way I guess all the variables used in the library will be duplicated for each thread.

When I run the program on a x8 cores processor, the time required to complete the operation doesn't go further than 1/3 the time needed for the single process implementation.

I know that it is impossible to reach 1/8 of the time, but i thought that at least 1/6 was reachable.

The question is: are those __declspec(thread) variables the cause of so bad performances?

Solution

If you declare them as __declspec(thread) where they were previously global, then you have changed the meaning of the program, as well as its performance characteristics.

When the variable was a global there was a single copy that each thread referred to. As a thread local, each separate thread has its own variable and changes to that thread local variable are only visible in that thread.

Assuming that you really want thread local then it is true that reading and writing thread local variables is more expensive than normal variables. Whenever you are faced with an operation that takes a long time to perform, the best solution is to stop doing it at all. In this case there are two obvious ways to do so:

Pass the variable around as a parameter so that it resides on the stack. Accessing stack variables is quick.
If you have functions that read and write this variable a lot, then take a copy of it at the start of the function (into a local variable), work on that local variable, and then on return, write it back to the thread local.

Of these options the former is usually to be preferred. Option 2 has the big weakness that it can't easily be applied if the function calls another function that uses this variable.

Option 1 basically amounts to not using global variables (thread locals are a form of global).

This all may be completely wide of the mark of course, because you have said so little about what your code is actually doing. If you want to solve a performance problem, you first have to identify where it is, and that means you need to measure.