Search code examples
pythonrandomscipymultiprocessingstrace

Python multiprocessing + scipy: excessive filesystem 'stat' and 'open' attempts


I am observing some extreme odd behaviour in Python. Consider the following code:

from multiprocessing import Process  
import scipy

def test():
    pass

for i in range(1000):
    p1 = Process(target=test)
    p1.start()
    p1.join()
    print i

When I run strace -f on this I get the following segment from the loop:

clone(Process 19706 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b23afde1970) = 19706
[pid 19706] set_robust_list(0x2b23afde1980, 0x18) = 0
[pid 18673] wait4(19706, Process 18673 suspended
 <unfinished ...>
[pid 19706] stat("/apps/python/2.7.1/lib/python2.7/multiprocessing/random", 0x7fff041fc150) = -1 ENOENT (No such file or directory)
[pid 19706] open("/apps/python/2.7.1/lib/python2.7/multiprocessing/random.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 19706] open("/apps/python/2.7.1/lib/python2.7/multiprocessing/randommodule.so", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 19706] open("/apps/python/2.7.1/lib/python2.7/multiprocessing/random.py", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 19706] open("/apps/python/2.7.1/lib/python2.7/multiprocessing/random.pyc", O_RDONLY) = -1 ENOENT (No such file or directory)
[pid 19706] open("/dev/urandom", O_RDONLY) = 3
[pid 19706] read(3, "\3\204g\362\260\324:]\337F0n\n\377\317\343", 16) = 16
[pid 19706] close(3)                    = 0
[pid 19706] open("/dev/null", O_RDONLY) = 3
[pid 19706] fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
[pid 19706] exit_group(0)               = ?
Process 18673 resumed
Process 19706 detached

What's up with all that junk about searching around the filesystem for 'random'? I really want to avoid this because I am running quite a lot of processes with this structure in parallel on a cluster, and looping quite fast, and this kind of filesystem activity is clogging up the filesystem metadata server, or so the cluster admins tell me.

If I remove the "import scipy" command then this problem goes away:

clone(Process 23081 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b42ec15e970) = 23081
[pid 23081] set_robust_list(0x2b42ec15e980, 0x18) = 0
[pid 22052] wait4(23081, Process 22052 suspended
 <unfinished ...>
[pid 23081] open("/dev/null", O_RDONLY) = 3
[pid 23081] fstat(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
[pid 23081] exit_group(0)               = ?
Process 22052 resumed
Process 23081 detached

but I need scipy in my real code so I can't just get rid of it. Or maybe I can, but that would be a pain.

Does anyone have any idea why I am seeing this behaviour? In case it is a quirk of some version of something I am running the following:

python: 2.7.1, multiprocessing: 0.70a1, scipy: 0.9.0,

Actually since I just realised it may be system dependent I ran the same code on my laptop and had no problem (i.e. output equivalent of the second case). On the laptop I am running

python: 2.6.5, multiprocessing: 0.70a1, scipy: 0.10.0,

Perhaps it is a problem or bug in the earlier version of scipy that has been fixed? My searches for anything like this have turned up nothing. Even if it IS the problem, it is not so easy to change versions of scipy on the cluster, although I can probably get the cluster admins to build the newer version if needed.

Is this likely to be the problem?


Solution

  • This is not because of Windows or the __main__ module. Nor is this how Python likes doing business. And, if you will re-check, I think you will find that it is a behavior of Python 2.6 and not 2.7 unless you are running a modified 2.7.

    You are entirely correct that the issue stems from the random-module initialization step in the multiprocessing.forking module — which is designed to prevent your process, when it forks to produce n workers, from creating workers that all step forward through exactly the same series of pseudo-random numbers (which could compromise security if, for example, they were all negotiating SSL connections using those numbers):

            if 'random' in sys.modules:
                import random
                random.seed()
    

    But the key here is to realize that the above import statement ought to be a no-op from a system-call point of view, because if a module name is already present as a key in the sys.modules dictionary then import simply returns the value that it finds there without trying to go load anything from the filesystem:

    >>> import sys
    >>> sys.modules['fake'] = 'Not even a module'
    >>> import fake
    >>> fake
    'Not even a module'
    

    The if statement quoted above, therefore, is specifically trying to prevent the expense of an extra import in the case that the random module has not even been loaded. When you do the experiment without scipy loaded up, the if statement body never even fires.

    So what is the problem?

    The problem is that older versions of Python before 2.7 let you mean two different things by saying import foo in a module that lives inside of a package: you might be attempting a relative import of the_package.foo, or you might be attempting an import of the top-level package foo. See PEP 328 for the details on why this ambiguous and expensive behavior has now been changed in more recent versions of Python:

    http://legacy.python.org/dev/peps/pep-0328/

    With this background, you can review your strace output and notice something that no one has yet mentioned in the answers here: the stat() and open() system calls listed are not trying to import the module random but the non-existent module named multiprocessing.random!

    This is the crucial reason that an additional import is being attempted even though random is already listed in sys.modules — because before Python 2.6 is allowed to fall back to the assumption that the import statement is really aiming to import random, it has to eliminate the possibility that it is instead attempting a relative import of multiprocessing.random since the import statement appears in the code of the multiprocessing.forking sub-module.

    The programmer ought really to have said sys.modules['random'].seed() instead of trying a fresh import to spare you those extra system calls. But hopefully you will not be troubled long by this behavior, once you have the chance to upgrade to a more recent version of Python.