Search code examples
pythonlinuxnfs

Multiprocessing python on NFS gets 'No such file or directory' from os.cwd()


I've been troubleshooting a test case on an EL9 environment which behaves differently in an EL7 environment.

It's specifically:

  • On an NFS mounts.
  • Occurs on our EL9 systems (5.14.0-362.18.1.el9_3.x86_64). Does not occur on any of our older Centos 7 systems (3.10 kernel various versions).
  • In a subdirectory. (Doesn't happen in a 'top level' mount).
  • The code frequently 'breaks' with os.cwd() returning 'no such file or directory'.

Reproduced on:

  • 5.14.0-427.el9.x86_64 (Alma)
  • 5.14.0-362.24.1.el9_3.x86_64 (Alma)
  • 5.14.0-362.18.1.el9_3.x86_64 (Alma)
  • 5.14.0-362.8.1.el9_3.x86_64 (RHEL)

Example reproduction code:

#!/usr/bin/python3 -u

import os
import multiprocessing
import time

if __name__=="__main__":
    multiprocessing.set_start_method("spawn")

    count = 0
    while True:
        try:
            os.getcwd()
            pool = multiprocessing.Pool(10)
            pool.close()
            pool.terminate()
            count += 1
        except Exception as e:
            print(f"Failed after {count} iterations")
            print(e)
            break

I'm at something of a loss to understand quite what's going on here, and quite why it fails.

It seems to be connected to pool.terminate() as if you add even a short (0.05) sleep before that, the problem stops occurring (at least, wasn't reproducible in 'sensible' amounts of time, where the above fails in <10 iterations).

And interestingly the cwd stays inconsistent:

build-2[myuser]:[~/python_race]$ ./simple.py 
Failed after 2 iterations
[Errno 2] No such file or directory
build-2[myuser]:[~/python_race]$ ./simple.py 
Traceback (most recent call last):
  File "<frozen importlib._bootstrap_external>", line 1362, in _path_importer_cache
KeyError: '.'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./simple.py", line 4, in <module>
    import multiprocessing
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 982, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 925, in _find_spec
  File "<frozen importlib._bootstrap_external>", line 1423, in find_spec
  File "<frozen importlib._bootstrap_external>", line 1392, in _get_spec
  File "<frozen importlib._bootstrap_external>", line 1364, in _path_importer_cache
  File "<frozen importlib._bootstrap_external>", line 1340, in _path_hooks
  File "<frozen importlib._bootstrap_external>", line 1610, in path_hook_for_FileFinder
  File "<frozen importlib._bootstrap_external>", line 1486, in __init__
FileNotFoundError: [Errno 2] No such file or directory

This seems to 'recover' with a chdir, or waiting for what I assume is an NFS cache expiry interval.

We've picked this up due to some of our tests failing on the 'new version' on a more simplistic:

import multiprocessing
multiprocessing.set_start_method("spawn")

while True:
    multiprocessing.Pool(10)

But as mentioned - specific to NFS mounts in subdirs which we think might be because the struct dentry behaving a bit differently on root mounts.

But I was wondering if anyone could shed some insight into what might be going wrong here? I'm at a loss to identify whether it might be fileserver, linux kernel, something within python, or... well, somewhere else entirely.

With rpcdebug enabled we appear to get:

Apr 24 11:58:47 build-2 kernel: NFS: release(lib/libgcc_s.so.1)
Apr 24 11:58:47 build-2 kernel: NFS: release(lib/libc.so.6)
Apr 24 11:58:47 build-2 kernel: NFS: release(locale/locale-archive)
Apr 24 11:58:47 build-2 kernel: NFS reply getattr: -512
Apr 24 11:58:47 build-2 kernel: nfs_revalidate_inode: (0:53/3834468196) getattr failed, error=-512
Apr 24 11:58:47 build-2 kernel: NFS: lookup(/python_race)
Apr 24 11:58:47 build-2 kernel: NFS call  lookup /python_race
Apr 24 11:58:47 build-2 kernel: RPC:       xs_tcp_send_request(168) = 0
Apr 24 11:58:47 build-2 kernel: NFS: release(bin/python3.11)

Edit:

Have also found it does not occur if you add a pool.join() prior to the terminate()


Solution

  • This isn't really an answer-answer, as much as confirmation that it is, in fact, a kernel bug, not a python issue - it just so happens that python was an easy way to 'trigger' it.

    Logged with RedHat: https://issues.redhat.com/browse/RHEL-35853

    If anyone's hitting the same problem, we found that adding:

    pool.join()
    

    into the code prior to pool.terminate() seems to stop it occurring, but give the issue is in kernel space, we're not confident that it's anything more substantive than "just" a short delay in the code, which masks the implicit race condition.

    We've found that a sleep(0.05) likewise seems to prevent it happening, presumably for similar reasons, as there's no way userspace should be able to invalidate the directory entry caching in the first place.