Search code examples
pythonpython-3.xpython-asynciostat

How can I asyncio schedule a filesystem stat operation?


Converting some code to using asyncio, I'd like to give back control to the asyncio.BaseEventLoop as quickly as possible. This means to avoid blocking waits.

Without asyncio I'd use os.stat() or pathlib.Path.stat() to obtain e.g. the filesize. Is there a way to do this efficiently with asyncio?

Can I just wrap the stat() call so it is a future similar to what's described here?


Solution

  • os.stat() translates to a stat syscall:

    $ strace python3 -c 'import os; os.stat("/")'
    [...]
    stat("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
    [...]
    

    which is blocking, and there's no way to get a non-blocking stat syscall.

    asyncio provides non-blocking I/O by using non-blocking system calls, which already exists (see man fcntl, with its O_NONBLOCK flag, or ioctl), so asyncio is not making syscalls asynchronous, it exposes already asynchronous syscalls in a nice way.

    It's still possible to use the nice ThreadPoolExecutor abstraction to make your blocking stat calls in parallel using a pool of threads.

    But you may first consider some other parameters:

    • According to strace -T, stat is fast: stat("/", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 <0.000007>, probably faster than starting and synchronizing threads.
    • stat is probably in much cases IO bound, so using more CPUs won't help
    • Doing parallel I/O may break a nice sequential access to a random access, phisical hard drive may be slower in this context.

    But there's also a lot of possibilities for your stats to be faster using a thread pool, like if you're hitting a distributed file system.

    You may also take a look at functools.lru_cache: if you're doing multiple stat on the same file or directory, and you're sure it has not changed, caching the result avoids a syscall.

    To conclude, "keep it simple", "os.stat" is the efficient way to get a filesize.