What the problem was, in case people have a similar problem: after some discussions with Mathworks support, it turned out to be a conflict between the system boost and Matlab's shipped boost libraries: when I compiled with system boost headers and linked with (older) Matlab boost libraries, it segfaulted. When I compiled and dynamically linked with system boost but then it dynamically loaded the Matlab boost libraries, it hung forever.
Static linking to system boost works, as does downloading the correct headers for the version of boost that Matlab ships with and compiling with those. Of course, the Mac builds of Matlab don't have version numbers in their filenames, though the Linux and supposedly Windows builds do. R2011b uses boost 1.44, for reference.
I have some multithreaded code that works fine when it's compiled directly, but segfaults and/or deadlocks when it's called from a Matlab mex
interface. I don't know whether the different environment is revealing a flaw in my code, or what, but I can't figure it out....
I'm running this on three machine configurations (though there are several of the CentOS boxes):
Here's a pared-down version with this behavior.
#include <queue>
#include <vector>
#include <boost/thread.hpp>
#include <boost/utility.hpp>
#ifndef NO_MEX
#include "mex.h"
#endif
class Worker : boost::noncopyable {
boost::mutex &jobs_mutex;
std::queue<size_t> &jobs;
boost::mutex &results_mutex;
std::vector<double> &results;
public:
Worker(boost::mutex &jobs_mutex, std::queue<size_t> &jobs,
boost::mutex &results_mutex, std::vector<double> &results)
:
jobs_mutex(jobs_mutex), jobs(jobs),
results_mutex(results_mutex), results(results)
{}
void operator()() {
size_t i;
float r;
while (true) {
// get a job
{
boost::mutex::scoped_lock lk(jobs_mutex);
if (jobs.size() == 0)
return;
i = jobs.front();
jobs.pop();
}
// do some "work"
r = rand() / 315.612;
// write the results
{
boost::mutex::scoped_lock lk(results_mutex);
results[i] = r;
}
}
}
};
std::vector<double> doWork(size_t n) {
std::vector<double> results;
results.resize(n);
boost::mutex jobs_mutex, results_mutex;
std::queue<size_t> jobs;
for (size_t i = 0; i < n; i++)
jobs.push(i);
Worker w1(jobs_mutex, jobs, results_mutex, results);
boost::thread t1(boost::ref(w1));
Worker w2(jobs_mutex, jobs, results_mutex, results);
boost::thread t2(boost::ref(w2));
t1.join();
t2.join();
return results;
}
#ifdef NO_MEX
int main() {
#else
void mexFunction(int nlhs, mxArray **plhs, int nrhs, const mxArray **prhs) {
#endif
std::vector<double> results = doWork(10);
for (size_t i = 0; i < results.size(); i++)
printf("%g ", results[i]);
printf("\n");
}
Note that on boost 1.48, I get the same behavior if I change the functor into a standard function and just pass boost::ref
s to the mutexes/data as extra arguments to boost::thread
. Boost 1.33.1 doesn't support this, though.
When I compile it directly, it always runs fine -- I've never seen it fail in any situation:
$ g++ -o testing testing.cpp -lboost_thread-mt -DNO_MEX
$ ./testing
53.2521 895008 5.14128e+06 3.12074e+06 3.62505e+06 1.48984e+06 320100 4.61912e+06 4.62206e+06 6.35983e+06
Running from Matlab, I've seen a lot of different behaviors after making different tweaks to the code and so on, though no changes that actually make any sense to me. But here's what I've seen with the exact code above:
boost::thread::start_thread
, being called from t1
's constructor.boost::thread::join
. I'm not entirely certain, but I think the worker threads have actually completed at this point (don't see anything in info threads
that's obviously them).pthread_mutex_lock
, being called from the boost::thread::join
on t1
.__lll_lock_wait
inside pthread_mutex_lock
in the same place. As shown below, the worker threads have completed at this point.I don't know how to do anything more with the segfaults, since they never occur when I have debugging symbols that can actually tell me what the null pointer is.
In the hanging-forever case, I seem to always get something like this if I'm stepping through in GDB:
99 Worker w1(jobs_mutex, jobs, results_mutex, results);
(gdb)
100 boost::thread t1(boost::ref(w1));
(gdb)
[New Thread 0x47814940 (LWP 19390)]
102 Worker w2(jobs_mutex, jobs, results_mutex, results);
(gdb)
103 boost::thread t2(boost::ref(w2));
(gdb)
[Thread 0x47814940 (LWP 19390) exited]
[New Thread 0x48215940 (LWP 19391)]
[Thread 0x48215940 (LWP 19391) exited]
105 t1.join();
That sure looks like both threads are complete before the call to t1.join()
. So I tried adding a sleep(1)
call in the "doing work" section between the locks; when I'm stepping through, the threads exit after the call to t1.join()
and it still hangs forever:
106 t1.join();
(gdb)
[Thread 0x47814940 (LWP 20255) exited]
[Thread 0x48215940 (LWP 20256) exited]
# still hanging
If I up
out to the doWork
function, results
is populated with the same results that the standalone version prints on this machine, so it looks like all that is going through.
I have no idea what's causing either of the segfaults or the crazy hanging-ness, or why it is that it always works outside Matlab and never inside, or why it's different with/without debugging symbols, and I have no idea how to proceed in figuring this out. Any thoughts?
At @alanxz's suggestion, I've run the standalone version of the code under valgrind's memcheck, helgrind, and DRD tools:
valgrind --tool=helgrind ls
) on OSX, complaining about an unsupported instruction.The DRD errors are pretty inscrutable to me, and though I've read the manual and so on, I can't make any sense of them. Here's the first one, on a version of the code where I commented out the second worker/thread:
Thread 2:
Conflicting load by thread 2 at 0x0004b518 size 8
at 0x3B837: void boost::call_once<void (*)()>(boost::once_flag&, void (*)()) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
by 0x2BCD4: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
by 0x2BA62: thread_proxy (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
by 0x2D88BE: _pthread_start (in /usr/lib/system/libsystem_c.dylib)
by 0x2DBB74: thread_start (in /usr/lib/system/libsystem_c.dylib)
Allocation context: Data section of r/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib
Other segment start (thread 1)
at 0x41B4DE: __bsdthread_create (in /usr/lib/system/libsystem_kernel.dylib)
by 0x2B959: boost::thread::start_thread() (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
by 0x100001B54: boost::thread::thread<boost::reference_wrapper<Worker> >(boost::reference_wrapper<Worker>, boost::disable_if<boost::is_convertible<boost::reference_wrapper<Worker>&, boost::detail::thread_move_t<boost::reference_wrapper<Worker> > >, boost::thread::dummy*>::type) (thread.hpp:204)
by 0x100001434: boost::thread::thread<boost::reference_wrapper<Worker> >(boost::reference_wrapper<Worker>, boost::disable_if<boost::is_convertible<boost::reference_wrapper<Worker>&, boost::detail::thread_move_t<boost::reference_wrapper<Worker> > >, boost::thread::dummy*>::type) (thread.hpp:201)
by 0x100000B50: doWork(unsigned long) (testing.cpp:66)
by 0x100000CE1: main (testing.cpp:82)
Other segment end (thread 1)
at 0x41BBCA: __psynch_cvwait (in /usr/lib/system/libsystem_kernel.dylib)
by 0x3C0C3: boost::condition_variable::wait(boost::unique_lock<boost::mutex>&) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
by 0x2D28A: boost::thread::join() (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
by 0x100000B61: doWork(unsigned long) (testing.cpp:72)
by 0x100000CE1: main (testing.cpp:82)
Line 66 is the construction of the thread, and 72 is the join
call; there's nothing but comments in between. As far as I can tell, this is saying that there's a race between that part of the master thread and the worker thread's initialization...but I don't really understand how that's possible?
The rest of the output from DRD is here; I'm not getting anything out of it.
Are you sure that's the simplest case that segfaults and/or hangs? If the results from DRD do indicate a race condition just between thread construction and joining, it sounds like your code might not be at fault (especially since you don't actually use any mex
-specific features, but just running under mex
triggers the bug).
Maybe try just this version:
#include <boost/thread.hpp>
void doNothing() { return; }
void doWork() {
boost::thread t1(doNothing);
t1.join();
}
#ifdef NO_MEX
int main() {
#else
#include "mex.h"
void mexFunction(int nlhs, mxArray **plhs, int nrhs, const mxArray **prhs) {
#endif
doWork();
}
This definitely shouldn't segfault or hang either under mex
or compiled directly - so if it does, it's not your bug, and if it doesn't, maybe you can progressively close the distance between your version and this one to find the bug-causing addition.