Search code examples
segmentation-faultfortransubroutineabortfortran77

dlatrs subroutine deallocates counter in calling subroutine


I've been trying to debug a segfault/sigabort that crashes my simulation software. I've been able to track this to a certain LAPACK subroutine changing a (or actually deallocating, I think) a grid counter in a subroutine that (through a few other subroutines) calls this LAPACK subroutine. Here is my gdb debug session tracking this bug:

(gdb) break trifactorize
Breakpoint 1 at 0x44ad28: file /home/nspeelman/chem1d/src/solver.f, line 925.
(gdb) run
Starting program: /home/nspeelman/chem1d/bin/chem1d

(gdb) watch k
Hardware watchpoint 2: k
(gdb) c
Continuing.
Hardware watchpoint 2: k

Old value = 4
New value = 1
0x000000000044ad48 in trifactorize (lsing=@0x7fffffffdedc) at /home/nspeelman/chem1d/src/solver.f:927
927           DO k = 1, Npoint
(gdb) c
Continuing.
Hardware watchpoint 2: k

Old value = 1
New value = -214061846
dlatrs (uplo=@0x4ffea4, trans=@0x4ffe98, diag=@0x4ffe94, normin=@0x7fffffffbeab, n=@0x3043664, a=0x16867060, 
    lda=@0x4fcda4, x=0x7fffffffcf90, scale=@0x7fffffffbe98, cnorm=0x7fffffffd910, info=@0x7fffffffdc2c, _uplo=5, 
    _trans=12, _diag=4, _normin=1) at /home/nspeelman/chem1d/src/lapack/src/dlatrs.f:334
334                 DO 20 J = 1, N - 1
(gdb) c
Continuing.

Program received signal SIGABRT, Aborted.
0x00002aaaab480265 in raise () from /lib64/libc.so.6

And the backtrace:

(gdb) bt
#0  0x00002aaaab480265 in raise () from /lib64/libc.so.6
#1  0x00002aaaab481d10 in abort () from /lib64/libc.so.6
#2  0x00002aaaaad88c9e in internal_unpack (d=0x62e9, s=0x62e9)
    at ../../../gcc-4.3.4/libgfortran/runtime/in_unpack_generic.c:104
#3  0x000000000044b41c in trifactorize (lsing=@0x7fffffffdedc) at /home/nspeelman/chem1d/src/solver.f:940
#4  0x3f68b52055ec1bbd in ?? ()
#5  0x3f62d2224f8b7801 in ?? ()
#6  0x3f62e59e02e2572f in ?? ()
#7  0x3f70bab1bd0628c7 in ?? ()
#8  0x3f70cdc893daf1cf in ?? ()
#9  0x3f5a3418697ab4dc in ?? ()
#10 0x3f6b117db1893c97 in ?? ()
#11 0x3f6e0dd1b55652b4 in ?? ()
#12 0x3f6864101f64d2f0 in ?? ()
#13 0x3f7359216186a4dc in ?? ()
#14 0x3f527ee1ff8feb69 in ?? ()
#15 0x3f672c7c504a10f8 in ?? ()
#16 0x3f68f2c8e0ee6963 in ?? ()
#17 0x3f54726715d81583 in ?? ()
#18 0x3f68f2c8e0ee6963 in ?? ()
#19 0x3f6df6a7d0f5e9a5 in ?? ()
#20 0x3f5ef57fdf747822 in ?? ()
#21 0x3f56ef95d71519b0 in ?? ()
#22 0x3f6b736cc1c1feb4 in ?? ()
#23 0x3f60fb91d9400ca4 in ?? ()
#24 0x3f56ef95d71519b0 in ?? ()
#25 0x3f4b4753f00f24d5 in ?? ()
#26 0x3f5a8b9cc465a316 in ?? ()
#27 0x3f3855b423b18a6b in ?? ()
#28 0x3f568360294ec05f in ?? ()
#29 0x3f3679d6e42a4759 in ?? ()
#30 0x3f228fe18a3e97ab in ?? ()
#31 0x3f50df603cf17c50 in ?? ()
#32 0x3f5d0cea5690c8f8 in ?? ()
#33 0x3f550552679170e1 in ?? ()
#34 0x3f3d0ebaa18f7a6f in ?? ()
#35 0x3f66a1b9ef4a7dc4 in ?? ()
#36 0x3f345e9bec8a3d7a in ?? ()
#37 0x3f43854a676ff7cb in ?? ()
#38 0x3f4acbe712d1ba00 in ?? ()
#39 0x3f191497deb0cd86 in ?? ()
#40 0x3f48e220ec5df7ee in ?? ()
#41 0x3f4326498a95447b in ?? ()
#42 0x3f2e05ee4edaa6ff in ?? ()
#43 0x3f285f8e79fe6b92 in ?? ()
#44 0x3f58240bec575a1d in ?? ()
#45 0x3f47c79be94754f7 in ?? ()
#46 0x3f5cf356ce58b75a in ?? ()
#47 0x3f28c2a87c82305d in ?? ()
#48 0x3f35fca48157c9e4 in ?? ()
#49 0x3f41c924b53cdbae in ?? ()
#50 0x3f477c6c115fb520 in ?? ()
#51 0x0000000000000000 in ?? ()

I've been able to reproduce with Ubuntu 11.10 with gfortran 4.6.1, Ubuntu 12.04 with gfortran 4.6.3, Scientific Linux 5.6 with gfortran 4.3.4 and Microsoft Windows with gfortran 4.5.0-1. When I use the Intel Compiler on the Linux boxes I cannot reproduce this error, but I cannot use ifort on Windows, because I'm on an academic license. But I need to fix this with gfortran, because I need to have a Windows version for some students. I'm using compiler flags -funroll-all-loops -fno-f2c -O3 for release versions and flags -fno-f2c -O0 -g3 for debug versions. Both options give these problems.

Also this bug is only reproducable when a large array is used. I'm working with arrays of maximum size (500,Ns) and working arrays of size (Ns,Ns,500). Simulations not crashing use Ns = 53 and when it does crash Ns = 153, but Ns has a declared size of 200.

Finally I'll show the code that crashes: solver.f, subroutine trifactorize:

      lSing = .FALSE.

      DO k = 1, Npoint
c---- Compute (jacB(k)-jacA(k)*jacC(k-1)). -----------------------------
         SELECT CASE ( k )
            CASE ( 1 )
            CASE DEFAULT
               CALL DGEMM( 'N', 'N', Ns, Ns, Ns, -1.0D0, jacA(:,:,k),
     >                     NsMax, jacC(:,:,k-1), NsMax, 1.0d0,
     >                     jacB(:,:,k), NsMax )
         END SELECT
c---- Factor with Gaussian elimination and estimate condition number.---
         norm = DLANGE( '1', Ns, Ns, jacB(:,:,k), NsMax, Work )
         CALL DGETRF( Ns, Ns, jacB(:,:,k), NsMax, ip(:,k), INFO )
         CALL DGECON( '1', Ns, jacB(:,:,k), NsMax, norm, Condit(k),
     >                Work, IWork, INFO )
c         WRITE(*,*)k,condit(k)
         IF ((1.0d0+condit(k)).EQ.1.0d0 .AND. iLogging.EQ.iDebug) THEN
            Write(line,10) 'Singular Jacobian Matrix'
            CALL ScreenWrite(line, iNormal)

            Write(line,11) 'Gridnumber: ', k
            CALL ScreenWrite(line, iNormal)

            lSing = .TRUE.
            RETURN
         ENDIF

c---- Compute jacC/jacB'-matrix ----------------------------------------
         CALL DGETRS( 'N', Ns, Ns, jacB(:,:,k), NsMax, ip(:,k),
     >                jacC(:,:,k), NsMax, INFO )
      ENDDO

 10   FORMAT(9X,3('-'),1X,9('-'),1X,9('-'),1X,A)
 11   FORMAT(9X,3('-'),1X,9('-'),1X,9('-'),1X,A,i4)

      CALL LogWrite('==> Decompose : Finished', iDebug)

RETURN

dlatrs.f:

*
*           A is lower triangular.
*
            DO 20 J = 1, N - 1
               CNORM( J ) = DASUM( N-J, A( J+1, J ), 1 )
   20       CONTINUE
            CNORM( N ) = ZERO

I've been wondering if I'm using the wrong compiler flags, or if I've stumbled across a rare gfortran bug. Hope somebody knows how to solve this.


Solution

  • I've installed gcc 4.7.0 on my Ubuntu machine and on my Windows build slave and this problem disappeared completely in both Windows and Ubuntu. So it seems this bug is fixed with the latest version of gcc.