java java-native-interface shared-libraries nasm

Is this a bug in the JVM, or in NASM?

I think I've found a bug, but I'm not sure whether to blame the JVM, JNI, NASM, GCC, or ld.

The bug causes my compiled modding language called grug (see this for videos and explanations) to sporadically crash Minecraft (Java edition) when mods cause a SIGSEGV with infinite recursion.

My modding language just consists of a 9k line grug.c, delivering a compiler and linker for my custom programming language. Its shared object (.so) output is based on ~300 tests I wrote that diff it against the expected NASM output. So my compiler is basically a NASM compiler that takes .grug files as input, rather than .s x86-64 Assembly files. I'm explaining this to clarify why I can't just ditch NASM.

After days of carefully shaving my original program down, I've finally got a minimal reproducible example which shows that errors like FATAL ERROR in native method: Static field ID passed to JNI have a roughly 1 in 20 chance of being printed when these conditions are met:

JNI is used to call a C function that opens mage.so
mage.so was generated with ld from mage.o, where mage.o was generated with nasm from mage.s (where mage.s can just be an empty file)
After loading mage.so, JNI is used to call a C function that causes a SIGSEGV

I have been able to reproduce the errors on both my Ubuntu and Arch Linux computers.

What's strange is that generating mage.o from an empty mage.c with gcc, instead of using NASM, does not cause the errors to be printed, which is why I think NASM may be to blame.

Minimal reproducible example

Main.java:

class Main {
    private native void init();
    private native void foo();

    public static void main(String[] args) {
        new Main().run();
    }

    public void run() {
        System.loadLibrary("foo");

        init();

        long iteration = 0;
        for (int i = 0; i < 2; i++) {
            System.out.println("Iteration: " + ++iteration);
            foo();
        }
    }
}

foo.c:

#include <dlfcn.h>
#include <jni.h>
#include <pthread.h>
#include <setjmp.h>
#include <signal.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

jmp_buf jmp_buffer;

volatile pthread_t expected_thread;

static void segv_handler(int sig) {
    (void)sig;

    {
        char msg[] = "In segv_handler()\n";
        write(STDERR_FILENO, msg, sizeof(msg)-1);
    }

    if (!pthread_equal(pthread_self(), expected_thread)) {
        char msg[] = "Unexpected thread entered handler; exiting\n";
        write(STDERR_FILENO, msg, sizeof(msg)-1);
        _exit(EXIT_FAILURE);
    }

    siglongjmp(jmp_buffer, 1);
}

JNIEXPORT void JNICALL Java_Main_init(JNIEnv *env, jobject obj) {
    (void)env;
    (void)obj;

    fprintf(stderr, "Initializing...\n");

    struct sigaction sigsegv_sa = {
        .sa_handler = segv_handler,
        .sa_flags = SA_ONSTACK, // SA_ONSTACK gives SIGSEGV its own stack
    };

    // Handle stack overflow
    // See https://stackoverflow.com/a/7342398/13279557
    static char stack[SIGSTKSZ];
    stack_t ss = {
        .ss_size = SIGSTKSZ,
        .ss_sp = stack,
    };

    if (sigaltstack(&ss, NULL) == -1) {
        perror("sigaltstack");
        exit(EXIT_FAILURE);
    }

    if (sigfillset(&sigsegv_sa.sa_mask) == -1) {
        perror("sigfillset");
        exit(EXIT_FAILURE);
    }

    if (sigaction(SIGSEGV, &sigsegv_sa, NULL) == -1) {
        perror("sigaction");
        exit(EXIT_FAILURE);
    }

    void *dll = dlopen("./mage.so", RTLD_NOW);
    if (!dll) {
        fprintf(stderr, "dlopen(): %s\n", dlerror());
    }
}

void recurse() {
    recurse();
}

JNIEXPORT void JNICALL Java_Main_foo(JNIEnv *env, jobject obj) {
    (void)env;
    (void)obj;

    expected_thread = pthread_self();

    if (sigsetjmp(jmp_buffer, 1)) {
        fprintf(stderr, "Jumped\n");
        return;
    }

    fprintf(stderr, "Recursing...\n");

    recurse();
}

mage.s: an empty file

mage.c: an empty file

Compiling foo.so (you will need to replace the jdk include paths here with your own, which ls /usr/lib/jvm can help with):

gcc foo.c -o libfoo.so -shared -fPIC -g -Wall -Wextra -Wpedantic -Werror -Wfatal-errors -Wno-infinite-recursion -I/usr/lib/jvm/jdk-23.0.1-oracle-x64/include -I/usr/lib/jvm/jdk-23.0.1-oracle-x64/include/linux

Then assemble mage.s to mage.o:

nasm mage.s -felf64

And link mage.o to mage.so:

ld mage.o -o mage.so -shared

Finally we run Main.java in an infinite loop, which should eventually print FATAL ERROR in native method: Static field ID passed to JNI:

while true; do java -Xcheck:jni -XX:+AllowUserSignalHandlers -Djava.library.path=. Main.java; done

Hitting Ctrl+Z a few times will suspend the loop, where you can then use kill %% to kill it.

My questions

What I don't get is why generating mage.o from compiling mage.c, instead of from assembling mage.s, never gets the program to print the error:

gcc mage.c -c

I can see that the output of readelf -a mage.o, objdump -D mage.o, and xxd mage.o are all significantly larger when mage.o is generated from mage.c. This is due to GCC dumping GNU-specific sections in the ELF file and such, so I guess the error only being printed when using NASM to assemble mage.s may have to do with the sections in the ELF file?

What I also don't understand is why the pthread_equal() check I put in the signal handler that exits right away does not prevent the FATAL ERROR in native method: Static field ID passed to JNI from being printed. I figured that the error was caused by an internal JVM thread entering my handler, while it was meant to enter JVM's own SIGSEGV handler, but I guess not?

I know the program prints warnings about it wanting it to be ran with jsig, but as I described in this answer, using jsig is not possible when wanting to overwrite JVM's SIGSEGV handler with your own handler in C (as far as I've been able to tell from a week of research).

It's easy to throw my hands up by blaming the odd behavior of the NASM version on me not using jsig, but it doesn't make any logical sense to me. I'm still not sure whether it's actually a NASM or JVM issue.

I'm on Ubuntu 24.04.1, and here are the versions of the programs I am calling in the MRE:

$ java --version
java 23.0.1 2024-10-15
Java(TM) SE Runtime Environment (build 23.0.1+11-39)
Java HotSpot(TM) 64-Bit Server VM (build 23.0.1+11-39, mixed mode, sharing)

$ nasm --version
NASM version 2.16.01 # 2.16.03-1 also reproduces the error

$ gcc --version
gcc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0

$ ld --version
GNU ld (GNU Binutils for Ubuntu) 2.42

Solution

I guess JVM doesn't allocate a guard page for each thread.

I can reliably (100% chance) reproduce this error on my machine, even with an simplified version of code.

This code may or may not compile on your machine because I didn't use extra checks like -Wall and -Werror.

jmp_buf jmp_buffer;

static char *base;
static char *top;

static void segv_handler(int sig) {
    (void)sig;
    siglongjmp(jmp_buffer, 1);
}

JNIEXPORT void JNICALL Java_Main_foo(JNIEnv *env, jobject obj) {
    (void)env;
    (void)obj;
    char b;
    base = &b;

    struct sigaction sigsegv_sa = {
        .sa_handler = segv_handler
    };
    if (sigfillset(&sigsegv_sa.sa_mask) == -1) {
        perror("sigfillset");
        exit(EXIT_FAILURE);
    }

    if (sigaction(SIGSEGV, &sigsegv_sa, NULL) == -1) {
        perror("sigaction");
        exit(EXIT_FAILURE);
    }

    void *dll = dlopen("./mage.so", RTLD_NOW);
    if (!dll) {
        fprintf(stderr, "dlopen(): %s\n", dlerror());
    }

    if (sigsetjmp(jmp_buffer, 1)) {
        fprintf(stderr, "Jumped %lx %lx %ld\n", base, top, (base - top) / 1024);
        return;
    }

    char c;
    top = &c;
    while (1) {
        top--;
        *top = 1;
    }
}

I run it with gdb, so it pauses when therminated, give me a chance to inspect its' memory layout.

The program prints Jumped 7ffff64f9447 7ffff64b0fff 289. While its' memory layout, as shown in the /proc/<fd>/maps, looks like this:

7ffff6491000-7ffff649d000 r--p 00000000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff649d000-7ffff64ab000 r-xp 0000c000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff64ab000-7ffff64b0000 r--p 0001a000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff64b0000-7ffff64b1000 r--p 0001f000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff64b1000-7ffff64b2000 rw-p 00020000 08:10 174692                     /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so
7ffff64b2000-7ffff64b3000 rw-p 00000000 00:00 0
7ffff64b3000-7ffff64bb000 rw-s 00000000 08:10 6624                       /tmp/hsperfdata_root/4468 (deleted)
7ffff64bb000-7ffff64fb000 rwxp 00000000 00:00 0
7ffff64fb000-7ffff64fe000 r--p 00000000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff64fe000-7ffff6519000 r-xp 00003000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff6519000-7ffff651d000 r--p 0001e000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff651d000-7ffff651e000 r--p 00021000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
7ffff651e000-7ffff651f000 rw-p 00022000 08:10 1487                       /usr/lib/x86_64-linux-gnu/libgcc_s.so.1

Note that 7ffff64b0fff, which is the address where SIGSEGV is raised, is already in the memory range of /usr/lib/jvm/java-21-openjdk-amd64/lib/libjava.so, which means previous writes to the "stack" overwrites JVM internal states, causing unpredictable errors.

I'm still not sure why does this only happen when dlopen("mage.so") is called, but I think the absent of a guard page is the root cause.