Search code examples
linuxlinux-kernelaio

How do you determine which process is using up Linux aio context capacity?


In Linux, you can read the value of /proc/sys/fs/aio-nr and this returns the total no. of events allocated across all active aio contexts in the system. The max value is controlled by /proc/sys/fs/aio-max-nr.

Is there a way to tell which process is responsible for allocating these aio contexts?


Solution

  • There isn't a simple way. At least, not that I've ever found! However, you can see them being consumed and freed using systemtap.

    https://blog.pythian.com/troubleshooting-ora-27090-async-io-errors/

    Attempting to execute the complete script in that article produced errors on my Centos 7 system. But, if you just take the first part of it, the part that logs allocations, it may give you enough insight:

    stap -ve '
    global allocated, allocatedctx
    
    probe syscall.io_setup {
      allocatedctx[pid()] += maxevents; allocated[pid()]++;
      printf("%d AIO events requested by PID %d (%s)\n",
        maxevents, pid(), cmdline_str());
    }
    '
    

    You'll need to coordinate things such that systemtap is running before your workload kicks in.

    Install systemtap, then execute the above command. (Note, I've altered this slightly from the linked article to removed the unused freed symbol.) After a few seconds, it'll be running. Then, start your workload.

    Pass 1: parsed user script and 469 library scripts using 227564virt/43820res/6460shr/37524data kb, in 260usr/10sys/263real ms.
    Pass 2: analyzed script: 5 probes, 14 functions, 101 embeds, 4 globals using 232632virt/51468res/11140shr/40492data kb, in 80usr/150sys/240real ms.
    Missing separate debuginfos, use: debuginfo-install kernel-lt-4.4.70-1.el7.elrepo.x86_64
    Pass 3: using cached /root/.systemtap/cache/55/stap_5528efa47c2ab60ad2da410ce58a86fc_66261.c
    Pass 4: using cached /root/.systemtap/cache/55/stap_5528efa47c2ab60ad2da410ce58a86fc_66261.ko
    Pass 5: starting run.
    

    Then, once your workload starts, you'll see the context requests logged:

    128 AIO events requested by PID 28716 (/Users/blah/awesomeprog)
    128 AIO events requested by PID 28716 (/Users/blah/awesomeprog)
    

    So, not as simple as lsof, but I think it's all we have!