Following code snippet causes JVM crash: if network outage occurs after acquiring lock
while (true) {
//file shared over nfs
String filename = "/home/amit/mount/lock/aLock.txt";
RandomAccessFile file = new RandomAccessFile(filename, "rws");
System.out.println("file opened");
FileLock fileLock = file.getChannel().tryLock();
if (fileLock != null) {
System.out.println("lock acquired");
} else {
System.out.println("lock not acquired");
}
try {
//wait for 15 sec
Thread.sleep(30000);
} catch (InterruptedException e) {
e.printStackTrace();
}
System.out.println("closing filelock");
fileLock.close();
System.out.println("closing file");
file.close();
}
Observation: JVM receives KILL(9) signal and exits with exit code 137(128+9).
Probably after network connection re-establishment something goes wrong in file-descriptor tables. This behavior is reproducible with system call flock(2) and shell utility flock(1).
Any suggestion/work-arounds?
PS: using Oracle JDK 1.7.0_25 with NFSv4
EDIT: This lock will be used to identify which of process is active in distributed high availability cluster. The exit code is 137. What I expect? way to detect problem. close file and try to re-acquire.
After NFS server reboots, all clients that have any active file locks start the lock reclamation procedure that lasts no longer than so-called "grace period" (just a constant). If the reclamation procedure fails during the grace period, NFS client (usually a kernel space beast) sends SIGUSR1 to a process that wasn't able to recover its locks. That's the root of your problem.
When the lock succeeds on the server side, rpc.lockd on the client system requests another daemon, rpc.statd, to monitor the NFS server that implements the lock. If the server fails and then recovers, rpc.statd will be informed. It then tries to reestablish all active locks. If the NFS server fails and recovers, and rpc.lockd is unable to reestablish a lock, it sends a signal (SIGUSR1) to the process that requested the lock.
http://menehune.opt.wfu.edu/Kokua/More_SGI/007-2478-010/sgi_html/ch07.html
You're probably wondering how to avoid this. Well, there're a couple of ways, but none is ideal: