I have successfully used Armadillo coupled with OpenBLAS in master's thesis on Ubuntu 14.04 64bit (both with Armadillo installed and without installation). The performance was very impressive - my code consisted mainly from basic matrix operations. All of these were carried out using all threads available.
Now I try to use Armadillo with OpenBLAS on Windows 7 64bit machine in Visual Studio 2013. I have found some help online and successfully added PThread library. The code itself works, but the performance is poor. I test three basic operations using 1000x1000 matrix - addition, multiplication and element-wise multiplication. Out of these three, only classical multiplication uses all the CPU power. The other two use 25% CPU, which indicates they run on single thread.
I have not encoutered this behavior in case of Ubuntu. Does anyone have any suggestion? I haven't seen any link, where someone had similar issue.
Are you sure that OpenBLAS is using multiple threads on Ubuntu for addition and element-wise multiplication? Intuitively I'd expect those operations to be BW-limited rather than FPU-limited, so I'd guess multithreading wouldn't help that much?