Search code examples
clinuxiolinux-kernelaio

How does the Linux kernel handle Asynchronous I/O (AIO) requests?


I am writing a C program to read data from an SSD drive by reading directly from the raw block device file.

I am trying the Linux AIO (I am talking about the Linux AIO API, i.e. the functions provided by linuxaio.h, such as io_submit(...) etc., not the POSIX AIO API). I open the block device file using the O_DIRECT flag and I make sure that I write to buffers are aligned to block size.

I noticed that Linux AIO it is considerably faster than using syncronous IO also with O_DIRECT flag.

The thing that surprised me the most is that the throughput achieved by issuing many small random reads of few KBs each with Linux AIO is remarkably higher even than the throughput achieved doing a large (sequential) read of few MBs using synchronous I/O and O_DIRECT.

So, I would like to know: how come Linux AIO peforms that better than syncronous I/O? What does the kernel do when AIO is used? Does the kernel perform request reordering? Does using Linux AIO result in greater CPU utilization than using synchronous I/O?

Thanks a lot in advance


Solution

  • Short answer: Most likely the AIO implementation is "faster" because it submits multiple IOs in parallel, while the synchronous implementation has either zero or one I/O in flight. It has nothing to do with writing to memory or with the kernel I/O path having additional overhead for synchronous I/Os.

    You can check this using iostat -x -d 1. Look at the avgqu-sz (average queue size = the average number of in-flight I/Os) and %util (utilization = the percentage of the time the device had at least one I/O issued to it).

    Long answer:

    • The concept of "faster" is tricky when talking about I/O. Does "faster" mean higher bandwidth? Or is it lower latency? Or bandwidth at a given request size? Or latency at a given queue depth? Or a combination of latency, bandwidth, request size, queue depth, and the many other parameters or the workload? I assume here that you are taking about throughput/bandwidth, however, it is good to remember that the performance of a storage device is not a single dimension metric.

    • SSDs are highly parallel devices. An SSD is composed of many flash chips, each chip having multiples dies that can read/write independently. SSDs take advantage of this and perform many I/Os in parallel, without a noticeable increase in response time. Therefore, in terms of throughput, it matters a lot how many concurrent I/Os the SSD sees.

    • Lets understand what happens when a thread submits a synchronous I/O: a) the thread spends some CPU cycles preparing the I/O request (generate data, compute offset, copy data into buffer, etc.), b) the system call is performed (e.g. pread()), execution passes to kernel space, and the thread blocks, c) the I/O request is processed by the kernel & traverses the various kernel I/O layers, d) the I/O request is submitted to the device and traverses the interconnect (e.g. PCIe), e) the I/O request is processed by the SSD firmware, f) the actual read command is send to the appropriate flash chip, g) the SSD controller waits for the data, h) the SSD controller gets the data from the flash chip and sends it through the interconnect. At this point the data leaves the SSD and stages e-a) happen in reverse.

    • As you can see, the synchronous I/O process is playing request ping-pong with the SSD. During many of the stages described above no data is actually read from the flash chips. On top of this, although your SSD can process tens to hundreds of requests in parallel, it sees at most one request at any given moment of time. Therefore, throughput is very, very low because you are actually not really using the SSD.

    • Asynchronous I/O helps in two ways: a) it allows the process to submit multiple I/O requests in parallel (the SSD has enough work to keep busy), and b) it allows pipelining I/Os through the various processing stages (therefore decoupling stage latency from throughput).

    • The reason why you see asynchronous I/O being faster than synchronous I/O is because you compare apples and oranges. The synchronous throughput is at a given request size, low queue depth, and without pipelining. The asynchronous throughput is at a different request size, higher queue depth, and with pipelining. The numbers you saw are not comparable.

    • The majority of I/O intensive applications (i.e. most applications such as databases, webservers, etc.) have many threads that perform synchronous I/O. Although each thread can submit at most one I/O at any given moment in time, the kernel & the SSD device see many I/O requests that can be served in parallel. Multiple sync I/O requests results in the same benefits as multiple async I/O requests.

      The main differences between asynchronous and synchronous I/O come down to how I/O & processes scheduling and to the programming model. Both async & sync I/O can squeeze the same IOPS/throughput from a storage device if done right.