hang and/or segfault when using boost::threads from matlab, not when called directly

What the problem was, in case people have a similar problem: after some discussions with Mathworks support, it turned out to be a conflict between the system boost and Matlab's shipped boost libraries: when I compiled with system boost headers and linked with (older) Matlab boost libraries, it segfaulted. When I compiled and dynamically linked with system boost but then it dynamically loaded the Matlab boost libraries, it hung forever.

Static linking to system boost works, as does downloading the correct headers for the version of boost that Matlab ships with and compiling with those. Of course, the Mac builds of Matlab don't have version numbers in their filenames, though the Linux and supposedly Windows builds do. R2011b uses boost 1.44, for reference.

I have some multithreaded code that works fine when it's compiled directly, but segfaults and/or deadlocks when it's called from a Matlab mex interface. I don't know whether the different environment is revealing a flaw in my code, or what, but I can't figure it out....

I'm running this on three machine configurations (though there are several of the CentOS boxes):

OSX 10.7, g++ 4.2, boost 1.48, Matlab R2011a (clang++ 2.1 also works for standalone, haven't tried to get mex to use clang)
ancient CentOS, g++ 4.1.2, boost 1.33.1 (debug and not debug), Matlab R2010b
ancient CentOS, g++ 4.1.2, boost 1.40 (no debug versions installed), Matlab R2010b

Here's a pared-down version with this behavior.

#include <queue>
#include <vector>

#include <boost/thread.hpp>
#include <boost/utility.hpp>

#ifndef NO_MEX
#include "mex.h"
#endif

class Worker : boost::noncopyable {
    boost::mutex &jobs_mutex;
    std::queue<size_t> &jobs;

    boost::mutex &results_mutex;
    std::vector<double> &results;

    public:

    Worker(boost::mutex &jobs_mutex, std::queue<size_t> &jobs,
           boost::mutex &results_mutex, std::vector<double> &results)
        :
            jobs_mutex(jobs_mutex), jobs(jobs),
            results_mutex(results_mutex), results(results)
    {}

    void operator()() {
        size_t i;
        float r;

        while (true) {
            // get a job
            {
                boost::mutex::scoped_lock lk(jobs_mutex);
                if (jobs.size() == 0)
                    return;

                i = jobs.front();
                jobs.pop();
            }

            // do some "work"
            r = rand() / 315.612;

            // write the results
            {
                boost::mutex::scoped_lock lk(results_mutex);
                results[i] = r;
            }
        }
    }
};

std::vector<double> doWork(size_t n) {
    std::vector<double> results;
    results.resize(n);

    boost::mutex jobs_mutex, results_mutex;

    std::queue<size_t> jobs;
    for (size_t i = 0; i < n; i++)
        jobs.push(i);

    Worker w1(jobs_mutex, jobs, results_mutex, results);
    boost::thread t1(boost::ref(w1));

    Worker w2(jobs_mutex, jobs, results_mutex, results);
    boost::thread t2(boost::ref(w2));

    t1.join();
    t2.join();

    return results;
}

#ifdef NO_MEX
int main() {
#else
void mexFunction(int nlhs, mxArray **plhs, int nrhs, const mxArray **prhs) {
#endif
    std::vector<double> results = doWork(10);
    for (size_t i = 0; i < results.size(); i++)
        printf("%g ", results[i]);
    printf("\n");
}

Note that on boost 1.48, I get the same behavior if I change the functor into a standard function and just pass boost::refs to the mutexes/data as extra arguments to boost::thread. Boost 1.33.1 doesn't support this, though.

When I compile it directly, it always runs fine -- I've never seen it fail in any situation:

$ g++ -o testing testing.cpp -lboost_thread-mt -DNO_MEX
$ ./testing
53.2521 895008 5.14128e+06 3.12074e+06 3.62505e+06 1.48984e+06 320100 4.61912e+06 4.62206e+06 6.35983e+06

Running from Matlab, I've seen a lot of different behaviors after making different tweaks to the code and so on, though no changes that actually make any sense to me. But here's what I've seen with the exact code above:

On OSX / boost 1.48:
- If it's linked to a release-variant boost, I get a segfault trying to access a near-0 address inside of boost::thread::start_thread, being called from t1's constructor.
- If it's linked to a debug-variant boost, it hangs forever in the first boost::thread::join. I'm not entirely certain, but I think the worker threads have actually completed at this point (don't see anything in info threads that's obviously them).
On CentOS / boost 1.33.1 and 1.40:
- With release boost, I get a segfault in pthread_mutex_lock, being called from the boost::thread::join on t1.
- With debugging boost, it hangs forever in __lll_lock_wait inside pthread_mutex_lock in the same place. As shown below, the worker threads have completed at this point.

I don't know how to do anything more with the segfaults, since they never occur when I have debugging symbols that can actually tell me what the null pointer is.

In the hanging-forever case, I seem to always get something like this if I'm stepping through in GDB:

99      Worker w1(jobs_mutex, jobs, results_mutex, results);
(gdb) 
100     boost::thread t1(boost::ref(w1));
(gdb) 
[New Thread 0x47814940 (LWP 19390)]
102     Worker w2(jobs_mutex, jobs, results_mutex, results);
(gdb) 
103     boost::thread t2(boost::ref(w2));
(gdb) 
[Thread 0x47814940 (LWP 19390) exited]
[New Thread 0x48215940 (LWP 19391)]
[Thread 0x48215940 (LWP 19391) exited]
105     t1.join();

That sure looks like both threads are complete before the call to t1.join(). So I tried adding a sleep(1) call in the "doing work" section between the locks; when I'm stepping through, the threads exit after the call to t1.join() and it still hangs forever:

106     t1.join();
(gdb)
[Thread 0x47814940 (LWP 20255) exited]
[Thread 0x48215940 (LWP 20256) exited]
# still hanging

If I up out to the doWork function, results is populated with the same results that the standalone version prints on this machine, so it looks like all that is going through.

I have no idea what's causing either of the segfaults or the crazy hanging-ness, or why it is that it always works outside Matlab and never inside, or why it's different with/without debugging symbols, and I have no idea how to proceed in figuring this out. Any thoughts?

At @alanxz's suggestion, I've run the standalone version of the code under valgrind's memcheck, helgrind, and DRD tools:

On CentOS using valgrind 3.5, none of the tools give any non-suppressed errors.
On OSX using valgrind 3.7:
- Memcheck doesn't give any non-suppressed errors.
- Helgrind crashes for me when run on any binary (including e.g. valgrind --tool=helgrind ls) on OSX, complaining about an unsupported instruction.
- DRD gives over a hundred errors.

The DRD errors are pretty inscrutable to me, and though I've read the manual and so on, I can't make any sense of them. Here's the first one, on a version of the code where I commented out the second worker/thread:

Thread 2:
Conflicting load by thread 2 at 0x0004b518 size 8
   at 0x3B837: void boost::call_once<void (*)()>(boost::once_flag&, void (*)()) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
   by 0x2BCD4: boost::detail::set_current_thread_data(boost::detail::thread_data_base*) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
   by 0x2BA62: thread_proxy (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
   by 0x2D88BE: _pthread_start (in /usr/lib/system/libsystem_c.dylib)
   by 0x2DBB74: thread_start (in /usr/lib/system/libsystem_c.dylib)
Allocation context: Data section of r/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib
Other segment start (thread 1)
   at 0x41B4DE: __bsdthread_create (in /usr/lib/system/libsystem_kernel.dylib)
   by 0x2B959: boost::thread::start_thread() (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
   by 0x100001B54: boost::thread::thread<boost::reference_wrapper<Worker> >(boost::reference_wrapper<Worker>, boost::disable_if<boost::is_convertible<boost::reference_wrapper<Worker>&, boost::detail::thread_move_t<boost::reference_wrapper<Worker> > >, boost::thread::dummy*>::type) (thread.hpp:204)
   by 0x100001434: boost::thread::thread<boost::reference_wrapper<Worker> >(boost::reference_wrapper<Worker>, boost::disable_if<boost::is_convertible<boost::reference_wrapper<Worker>&, boost::detail::thread_move_t<boost::reference_wrapper<Worker> > >, boost::thread::dummy*>::type) (thread.hpp:201)
   by 0x100000B50: doWork(unsigned long) (testing.cpp:66)
   by 0x100000CE1: main (testing.cpp:82)
Other segment end (thread 1)
   at 0x41BBCA: __psynch_cvwait (in /usr/lib/system/libsystem_kernel.dylib)
   by 0x3C0C3: boost::condition_variable::wait(boost::unique_lock<boost::mutex>&) (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
   by 0x2D28A: boost::thread::join() (in /usr/local/boost/boost_1_48_0/stage/lib/libboost_thread-mt-d.dylib)
   by 0x100000B61: doWork(unsigned long) (testing.cpp:72)
   by 0x100000CE1: main (testing.cpp:82)

Line 66 is the construction of the thread, and 72 is the join call; there's nothing but comments in between. As far as I can tell, this is saying that there's a race between that part of the master thread and the worker thread's initialization...but I don't really understand how that's possible?

The rest of the output from DRD is here; I'm not getting anything out of it.

Solution

Are you sure that's the simplest case that segfaults and/or hangs? If the results from DRD do indicate a race condition just between thread construction and joining, it sounds like your code might not be at fault (especially since you don't actually use any mex-specific features, but just running under mex triggers the bug).

Maybe try just this version:

#include <boost/thread.hpp>

void doNothing() { return; }

void doWork() {
    boost::thread t1(doNothing);
    t1.join();
}

#ifdef NO_MEX
int main() {
#else
#include "mex.h"
void mexFunction(int nlhs, mxArray **plhs, int nrhs, const mxArray **prhs) {
#endif
    doWork();
}

This definitely shouldn't segfault or hang either under mex or compiled directly - so if it does, it's not your bug, and if it doesn't, maybe you can progressively close the distance between your version and this one to find the bug-causing addition.