Search code examples
javacjava-native-interfacesegmentation-fault

What can cause a Java native function (in C) to segfault upon entry?


The Project

I'm writing a Java command line interface to a C library of internal networking and network testing tools using the Java Native Interface. The C code (which I didn't write) is complex and low level, often manipulates memory at the bit level, and uses raw sockets exclusively. The application is multi-threaded from the C side (pthreads running in the background) as well as the Java side (ScheduledThreadPoolExecutors running threads that call native code). That said, the C library should be mostly stable. The Java and JNI interface code, as it turns out, is causing problems.

The Problem(s)

The application crashes with a segmentation fault upon entry into a native C function. This only happens when the program is in a specific state (i.e. successfully running a specific native function causes the next call to another specific native function to segfault). Additionally, the application crashes with a similar-looking segfault when the quit command is issued, but again, only after successfully running that same specific native function.

I'm an inexperienced C developer and an experienced Java developer -- I'm used to crashes giving me a specific reason and a specific line number. All I have to work from in this case is the hs_err_pid*.log output and the core dump. I've included what I could at the end of this question.

My Work So Far

  1. Naturally, I wanted to find the specific line of code where the crash happened. I placed a System.out.println() right before the native call on the Java side and a printf() as the first line of the native function where the program crashes being sure to use fflush(stdout) directly after. The System.out call ran and the printf call didn't. This tells me that the segfault happened upon entry into the function -- something I've never seen before.
  2. I triple checked the parameters to the function, to ensure that they wouldn't act up. However, I only pass one parameter (of type jint). The other two (JNIEnv *env, jobject j_object) are JNI constructs and out of my control.
  3. I commented out every single line in the function, leaving only a return 0; at the end. The segfault still happened. This leads me to believe that the problem is not in this function.
  4. I ran the command in different orders (effectively running the native functions different orders). The segfaults only happen when one specific native function is run before the crashing function call. This specific function appears to behave properly when it is run.
  5. I printed the value of the env pointer and the value of &j_object near the end of this other function, to ensure that I didn't somehow corrupt them. I don't know if I corrupted them, but both have non-zero values upon exiting the function.
  6. Edit 1: Typically, the same function is run in many threads (not usually concurrently, but it should be thread safe). I ran the function from the main thread without any other threads active to ensure that multithreading on the Java side wasn't causing the issue. It wasn't, and I got the same segfault.

All of this perplexes me. Why is does it still segfault if I comment out the whole function, except for the return statement? If the problem is in this other function, why doesn't it fail there? If it's a problem where the first function messes up the memory and the second function illegally accesses the corrupt memory, why doesn't if fail on the line with the illegal access, rather than on entry to the function?

If you see an internet article where someone explains a problem similar to mine, please comment it. There are so many segfault articles, and none seem to contain this specific problem. Ditto for SO questions. The problem may also be that I'm not experienced enough to apply an abstract solution to this problem.

My Question

What can cause a Java native function (in C) to segfault upon entry like this? What specific things can I look for that will help me squash this bug? How can I write code in the future that will help me avoid this problem?

Helpful Info

For the record, I can't actually post the code. If you think a description of the code would be helpful, comment and I'll edit it in.

Error Message

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00002aaaaaf6d9c3, pid=2185, tid=1086892352
#
# JRE version: 6.0_21-b06
# Java VM: Java HotSpot(TM) 64-Bit Server VM (17.0-b16 mixed mode linux-amd64 )
# Problematic frame:
# j  path.to.my.Object.native_function_name(I)I+0
#
# An error report file with more information is saved as:
# /path/to/hs_err_pid2185.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

The Important Bits of the hs_err_pid*.log File

---------------  T H R E A D  ---------------

Current thread (0x000000004fd13800):  JavaThread "pool-1-thread-1" [_thread_in_native, id=2198, stack(0x0000000040b8a000,0x0000000040c8b000)]

siginfo:si_signo=SIGSEGV: si_errno=0, si_code=128 (), si_addr=0x0000000000000000

Registers:
RAX=0x34372e302e3095e1, RBX=0x00002aaaae39dcd0, RCX=0x0000000000000000, RDX=0x0000000000000000
RSP=0x0000000040c89870, RBP=0x0000000040c898c0, RSI=0x0000000040c898e8, RDI=0x000000004fd139c8
R8 =0x000000004fb631f0, R9 =0x000000004faf5d30, R10=0x00002aaaaaf6d999, R11=0x00002b1243b39580
R12=0x00002aaaae3706d0, R13=0x00002aaaae39dcd0, R14=0x0000000040c898e8, R15=0x000000004fd13800
RIP=0x00002aaaaaf6d9c3, EFL=0x0000000000010202, CSGSFS=0x0000000000000033, ERR=0x0000000000000000
  TRAPNO=0x000000000000000d



Stack: [0x0000000040b8a000,0x0000000040c8b000],  sp=0x0000000040c89870,  free space=3fe0000000000000018k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
j  path.to.my.Object.native_function_name(I)I+0
j  path.to.my.Object$CustomThread.fire()V+18
j  path.to.my.CustomThreadSuperClass.run()V+1
j  java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4
j  java.util.concurrent.FutureTask$Sync.innerRun()V+30
j  java.util.concurrent.FutureTask.run()V+4
j  java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Ljava/util/concurrent/ScheduledThreadPoolExecutor$ScheduledFutureTask;)V+1
j  java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V+15
j  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V+59
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+28
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub
V  [libjvm.so+0x3e756d]
V  [libjvm.so+0x5f6f59]
V  [libjvm.so+0x3e6e39]
V  [libjvm.so+0x3e6eeb]
V  [libjvm.so+0x476387]
V  [libjvm.so+0x6ee452]
V  [libjvm.so+0x5f80df]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  path.to.my.Object.native_function_name(I)I+0
j  path.to.my.Object$CustomThread.fire()V+18
j  path.to.my.CustomThreadSuperClass.run()V+1
j  java.util.concurrent.Executors$RunnableAdapter.call()Ljava/lang/Object;+4
j  java.util.concurrent.FutureTask$Sync.innerRun()V+30
j  java.util.concurrent.FutureTask.run()V+4
j  java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(Ljava/util/concurrent/ScheduledThreadPoolExecutor$ScheduledFutureTask;)V+1
j  java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run()V+15
j  java.util.concurrent.ThreadPoolExecutor$Worker.runTask(Ljava/lang/Runnable;)V+59
j  java.util.concurrent.ThreadPoolExecutor$Worker.run()V+28
j  java.lang.Thread.run()V+11
v  ~StubRoutines::call_stub



---------------  P R O C E S S  ---------------

Java Threads: ( => current thread )
  0x000000004fabc800 JavaThread "pool-1-thread-6" [_thread_new, id=2203, stack(0x0000000000000000,0x0000000000000000)]
  0x000000004fbcb000 JavaThread "pool-1-thread-5" [_thread_blocked, id=2202, stack(0x0000000042c13000,0x0000000042d14000)]
  0x000000004fbc9800 JavaThread "pool-1-thread-4" [_thread_blocked, id=2201, stack(0x0000000042b12000,0x0000000042c13000)]
  0x000000004fbc7800 JavaThread "pool-1-thread-3" [_thread_blocked, id=2200, stack(0x0000000042a11000,0x0000000042b12000)]
  0x000000004fc54800 JavaThread "pool-1-thread-2" [_thread_blocked, id=2199, stack(0x0000000042910000,0x0000000042a11000)]
=>0x000000004fd13800 JavaThread "pool-1-thread-1" [_thread_in_native, id=2198, stack(0x0000000040b8a000,0x0000000040c8b000)]
  0x000000004fb04800 JavaThread "Low Memory Detector" daemon [_thread_blocked, id=2194, stack(0x0000000041d0d000,0x0000000041e0e000)]
  0x000000004fb02000 JavaThread "CompilerThread1" daemon [_thread_blocked, id=2193, stack(0x0000000041c0c000,0x0000000041d0d000)]
  0x000000004fafc800 JavaThread "CompilerThread0" daemon [_thread_blocked, id=2192, stack(0x0000000040572000,0x0000000040673000)]
  0x000000004fafa800 JavaThread "Signal Dispatcher" daemon [_thread_blocked, id=2191, stack(0x0000000040471000,0x0000000040572000)]
  0x000000004fad6000 JavaThread "Finalizer" daemon [_thread_blocked, id=2190, stack(0x0000000041119000,0x000000004121a000)]
  0x000000004fad4000 JavaThread "Reference Handler" daemon [_thread_blocked, id=2189, stack(0x0000000041018000,0x0000000041119000)]
  0x000000004fa51000 JavaThread "main" [_thread_in_vm, id=2186, stack(0x00000000418cc000,0x00000000419cd000)]

Other Threads:
  0x000000004facf800 VMThread [stack: 0x0000000040f17000,0x0000000041018000] [id=2188]
  0x000000004fb0f000 WatcherThread [stack: 0x0000000041e0e000,0x0000000041f0f000] [id=2195]

VM state:not at safepoint (normal execution)

VM Mutex/Monitor currently owned by a thread: None

Heap
 PSYoungGen      total 305856K, used 31465K [0x00002aaadded0000, 0x00002aaaf3420000, 0x00002aaaf3420000)
  eden space 262208K, 12% used [0x00002aaadded0000,0x00002aaadfd8a6a8,0x00002aaaedee0000)
  from space 43648K, 0% used [0x00002aaaf0980000,0x00002aaaf0980000,0x00002aaaf3420000)
  to   space 43648K, 0% used [0x00002aaaedee0000,0x00002aaaedee0000,0x00002aaaf0980000)
 PSOldGen        total 699072K, used 0K [0x00002aaab3420000, 0x00002aaadded0000, 0x00002aaadded0000)
  object space 699072K, 0% used [0x00002aaab3420000,0x00002aaab3420000,0x00002aaadded0000)
 PSPermGen       total 21248K, used 3741K [0x00002aaaae020000, 0x00002aaaaf4e0000, 0x00002aaab3420000)
  object space 21248K, 17% used [0x00002aaaae020000,0x00002aaaae3c77c0,0x00002aaaaf4e0000)


VM Arguments:
jvm_args: -Xms1024m -Xmx1024m -XX:+UseParallelGC


---------------  S Y S T E M  ---------------

OS:Red Hat Enterprise Linux Client release 5.5 (Tikanga)

uname:Linux 2.6.18-194.8.1.el5 #1 SMP Wed Jun 23 10:52:51 EDT 2010 x86_64
libc:glibc 2.5 NPTL 2.5
rlimit: STACK 10240k, CORE 102400k, NPROC 10000, NOFILE 1024, AS infinity
load average:0.21 0.08 0.05

CPU:total 1 (1 cores per cpu, 1 threads per core) family 6 model 26 stepping 4, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt

Memory: 4k page, physical 3913532k(1537020k free), swap 1494004k(1494004k free)

vm_info: Java HotSpot(TM) 64-Bit Server VM (17.0-b16) for linux-amd64 JRE (1.6.0_21-b06), built on Jun 22 2010 01:10:00 by "java_re" with gcc 3.2.2 (SuSE Linux)

time: Tue Oct 15 15:08:13 2013
elapsed time: 13 seconds

Valgrind Output

I don't really know how to use Valgrind properly. This is what came up when running valgrind app arg1

==2184== 
==2184== HEAP SUMMARY:
==2184==     in use at exit: 16,914 bytes in 444 blocks
==2184==   total heap usage: 673 allocs, 229 frees, 32,931 bytes allocated
==2184== 
==2184== LEAK SUMMARY:
==2184==    definitely lost: 0 bytes in 0 blocks
==2184==    indirectly lost: 0 bytes in 0 blocks
==2184==      possibly lost: 0 bytes in 0 blocks
==2184==    still reachable: 16,914 bytes in 444 blocks
==2184==         suppressed: 0 bytes in 0 blocks
==2184== Rerun with --leak-check=full to see details of leaked memory
==2184== 
==2184== For counts of detected and suppressed errors, rerun with: -v
==2184== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 7 from 7)

Edit 2:

GDB Output and Backtrace

I ran it through with GDB. I made sure that the C library was compiled with the -g flag.

$ gdb `which java`
GNU gdb (GDB) Red Hat Enterprise Linux (7.0.1-23.el5)
Copyright (C) 2009 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/bin/java...(no debugging symbols found)...done.
(gdb) run -jar /opt/scts/scts.jar test.config
Starting program: /usr/bin/java -jar /opt/scts/scts.jar test.config
[Thread debugging using libthread_db enabled]
Executing new program: /usr/lib/jvm/java-1.6.0-sun-1.6.0.21.x86_64/jre/bin/java
[Thread debugging using libthread_db enabled]
[New Thread 0x4022c940 (LWP 3241)]
[New Thread 0x4032d940 (LWP 3242)]
[New Thread 0x4042e940 (LWP 3243)]
[New Thread 0x4052f940 (LWP 3244)]
[New Thread 0x40630940 (LWP 3245)]
[New Thread 0x40731940 (LWP 3246)]
[New Thread 0x40832940 (LWP 3247)]
[New Thread 0x40933940 (LWP 3248)]
[New Thread 0x40a34940 (LWP 3249)]

... my program does some work, and starts a background thread ...

[New Thread 0x41435940 (LWP 3250)]

... I type the command that seems to cause the segfault on the next command; the new threads are expected ...

[New Thread 0x41536940 (LWP 3252)]
[New Thread 0x41637940 (LWP 3253)]
[New Thread 0x41738940 (LWP 3254)]
[New Thread 0x41839940 (LWP 3255)]
[New Thread 0x4193a940 (LWP 3256)]

... I type the command that actually triggers the segfault. The new thread is expected, since the function is run in its own thread. If it did not segfault, it would have created the same number of thread as the previous command ...

[New Thread 0x41a3b940 (LWP 3257)]

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x41839940 (LWP 3255)]
0x00002aaaabcaec45 in ?? ()

... I furiously read through the gdb help, then run the backtrace ...

(gdb) bt
#0  0x00002aaaabcaec45 in ?? ()
#1  0x00002aaaf3ad7800 in ?? ()
#2  0x00002aaaf3ad81e8 in ?? ()
#3  0x0000000041838600 in ?? ()
#4  0x00002aaaeacddcd0 in ?? ()
#5  0x0000000041838668 in ?? ()
#6  0x00002aaaeace23f0 in ?? ()
#7  0x0000000000000000 in ?? ()

... Shouldn't that have symbols if I compiled with -g? I did, according to the lines from the output of make:

gcc -g -Wall -fPIC -c -I ...
gcc -g -shared -W1,soname, ...

Solution

  • Looks like I've solved the issue, which I'll outline here for the benefit of others.

    What Happened

    The cause of the segmentation fault was that I used sprintf() to assign a value to a char * pointer which had not been assigned a value. Here is the bad code:

    char* ip_to_string(uint32_t ip)
    {
        unsigned char bytes[4];
        bytes[0] = ip & 0xFF;
        bytes[1] = (ip >> 8) & 0xFF;
        bytes[2] = (ip >> 16) & 0xFF;
        bytes[3] = (ip >> 24) & 0xFF;
    
        char *ip_string;
        sprintf(ip_string, "%d.%d.%d.%d", bytes[0], bytes[1], bytes[2], bytes[3]);
        return ip_string;
    }
    

    The pointer ip_string does not have a value here, which means it points to nothing. Except, that's not entirely true. What it points to is undefined. It could point anywhere. So in assigning a value to it with sprintf(), I inadvertently overwrote a random bit of memory. I believe that the reason for the odd behaviour (though I never confirmed this) was that the undefined pointer was pointing to somewhere on the stack. This caused the computer to be confused when certain functions were called.

    One way to fix this is to allocate memory and then point the pointer to that memory, which can be accomplished with malloc(). That solution would look similar to this:

    char* ip_to_string(uint32_t ip)
    {
        unsigned char bytes[4];
        bytes[0] = ip & 0xFF;
        bytes[1] = (ip >> 8) & 0xFF;
        bytes[2] = (ip >> 16) & 0xFF;
        bytes[3] = (ip >> 24) & 0xFF;
    
        char *ip_string = malloc(16);
        sprintf(ip_string, "%d.%d.%d.%d", bytes[0], bytes[1], bytes[2], bytes[3]);
        return ip_string;
    }
    

    The problem with this is that every malloc() needs to be matched by a call to free(), or you have a memory leak. If I call free(ip_string) inside this function the returned pointer will be useless, and if I don't then I have to rely on the code that's calling this function to release the memory, which is pretty dangerous.

    As far as I can tell, the "right" solution to this is to pass an already allocated pointer to the function, such that it is the function's responsibility to fill pointed to memory. That way, calls to malloc() and free() can be made in the block of code. Much safer. Here's the new function:

    char* ip_to_string(uint32_t ip, char *ip_string)
    {
        unsigned char bytes[4];
        bytes[0] = ip & 0xFF;
        bytes[1] = (ip >> 8) & 0xFF;
        bytes[2] = (ip >> 16) & 0xFF;
        bytes[3] = (ip >> 24) & 0xFF;
    
        sprintf(ip_string, "%d.%d.%d.%d", bytes[0], bytes[1], bytes[2], bytes[3]);
        return ip_string;
    }
    

    Answers to the Questions

    What can cause a Java native function (in C) to segfault upon entry like this?

    If you assign a value to a pointer that hasn't been allocated memory, you may accidentally overwrite memory on the stack. This may not cause an immediate failure, but will probably cause problems when you call other functions later.

    What specific things can I look for that will help me squash this bug?

    Look for a segmentation fault like any other. Things like assigning a value to unallocated memory or dereferencing a null pointer. I'm not an expert on this, but I'm willing to bet that there are many web resources for this.

    How can I write code in the future that will help me avoid this problem?

    Be careful with pointers, especially when you are responsible for creating them. If you see a line of code that looks like this:

    type *variable;
    

    ... then look for a line that looks like ...

    variable = ...;
    

    ... and make sure that this line comes before writing to the pointed to memory.