Segmentation Fault while Calling Lapack function from Eigen

My program is written in C++ and I use Eigen library for the matrix operations inside it. There is a huge matrix product inside it, and the dimension is 50000*1000 and 1000*50000. So I tried to call the BLAS function from MKL library to improve the performance. Then there is segmentation error while calling the dgemm function.

Here is the code

        double alpha = 1, beta = 0;
    double *s1;
    MKL_INT mm1 = q, nn1 = q, kk1 = ncol1;
    s1 = (double *)malloc(q*q*sizeof(double));
    cout << 14 << endl;

    cblas_dgemm(CblasColMajor, CblasNoTrans, CblasNoTrans,mm1, nn1, kk1, alpha, V.data(), mm1, A01.data(), kk1, beta, s1, mm1);

The code works fine with small dimension. I compiles the code with:

icpc lapack.cpp generators.cpp SimpleRNG.cpp example.cpp -lmkl_intel_ilp64 -lmkl_sequential -lmkl_core -lpthread -lm  -DMKL_ILP64 -o new_example.o

icpc lapack.cpp generators.cpp SimpleRNG.cpp example.cpp -lmkl_intel_lp64 -lmkl_sequential -lmkl_core -lpthread -lm -o new_example.o

i.e: I tried both the LP64 interface and the ILP64 interface, but neither of them work, can anyone help me out here? I run the program on Linux server and there is plenty of memory.

Thank you very much!

Solution

The below discussion assumes:

sizeof(double) == 8
MKL_INT is int, and sizeof(int) == 4
sizeof(std::size_t) == 8
CHAR_BIT == 8

These should be true in a typical 64-bit system.

Something very interesting happens in this line:

s1 = (double *)malloc(q*q*sizeof(double));

If q is 50000, then q*q is 2500000000. If q is int, then this causes signed integer overflow, which causes undefined behavior. In this particular case, the compiler is likely to simply wrap around (effectively subtracting 2³²), resulting in -1794967296.

However, when you then multiply -1794967296 with sizeof(double), which is of type std::size_t, which is an unsigned integer type, fun things happen. If size_t is 64-bit, then the compiler needs to convert -1794967296 to an unsigned 64-bit number, and this conversion conceptually happens by adding 2⁶⁴ to the number, giving you 18446744071914584320. When you multiply this by sizeof(double), it overflows again, but unsigned overflow is well-defined, and returns the result modulo 2⁶⁴ for 64-bit operands, so the final result is 18446744059349813248. (See demo here for the calculations).

Thus, your original code ends up trying to allocate 18446744059349813248 bytes of memory (this is almost 16 exabytes). Ouch. Obviously the allocation will fail and return a null pointer. Since you didn't check the return value, you'll get a segmentation fault later.

When you rewrite this as

s1 = (double *)malloc(sizeof(double) * q * q);

then sizeof(double) * q is evaluated first. This multiplication will convert q to std::size_t, but since q is positive, the conversion doesn't affect its value. Thus the result is well-defined, and is a std::size_t with the value 400000. The second multiplication is similarly well-defined - q is again converted to std::size_t, and the resulting multiplication produces 20000000000, which doesn't overflow a std::size_t, so your malloc call actually requests 20GB of memory.