Search code examples
javanfsfile-locking

JVM crash because of lock on nfs file after network outage


Following code snippet causes JVM crash: if network outage occurs after acquiring lock

    while (true) {

       //file shared over nfs
       String filename = "/home/amit/mount/lock/aLock.txt";
       RandomAccessFile file = new RandomAccessFile(filename, "rws");
       System.out.println("file opened");
       FileLock fileLock = file.getChannel().tryLock();
       if (fileLock != null) {
          System.out.println("lock acquired");
       } else {
          System.out.println("lock not acquired");
       }

       try {
          //wait for 15 sec
          Thread.sleep(30000);
       } catch (InterruptedException e) {
          e.printStackTrace();
       }
       System.out.println("closing filelock");
       fileLock.close();
       System.out.println("closing file");
       file.close();
    }

Observation: JVM receives KILL(9) signal and exits with exit code 137(128+9).

Probably after network connection re-establishment something goes wrong in file-descriptor tables. This behavior is reproducible with system call flock(2) and shell utility flock(1).

Any suggestion/work-arounds?

PS: using Oracle JDK 1.7.0_25 with NFSv4

EDIT: This lock will be used to identify which of process is active in distributed high availability cluster. The exit code is 137. What I expect? way to detect problem. close file and try to re-acquire.


Solution

  • After NFS server reboots, all clients that have any active file locks start the lock reclamation procedure that lasts no longer than so-called "grace period" (just a constant). If the reclamation procedure fails during the grace period, NFS client (usually a kernel space beast) sends SIGUSR1 to a process that wasn't able to recover its locks. That's the root of your problem.

    When the lock succeeds on the server side, rpc.lockd on the client system requests another daemon, rpc.statd, to monitor the NFS server that implements the lock. If the server fails and then recovers, rpc.statd will be informed. It then tries to reestablish all active locks. If the NFS server fails and recovers, and rpc.lockd is unable to reestablish a lock, it sends a signal (SIGUSR1) to the process that requested the lock.

    http://menehune.opt.wfu.edu/Kokua/More_SGI/007-2478-010/sgi_html/ch07.html

    You're probably wondering how to avoid this. Well, there're a couple of ways, but none is ideal:

    1. Increase grace period. AFAIR, on linux it can be changed via /proc/fs/nfsd/nfsv4leasetime.
    2. Make a SIGUSR1 handler in your code and do something smart there. For instance in a signal handler you could set a flag denoting that locks recovery is failed. If this flag is set your program can try to wait for a readiness of NFS server (as long as it needs) and then it can try to recover locks itself. Not very fruitful...
    3. Do not use NFS locking ever again. If it's possible switch to zookeeper as was suggested earlier.