Should I use IOSQE_ASYNC?

You’re using io_uring, and you want to know whether to use the IOSQE_AYSNC flag.

Read, the man pages, they say!

Normal operation for io_uring is to try and issue an sqe as
non-blocking first, and if that fails, execute it in an
async manner. To support more efficient overlapped
operation of requests that the application knows/assumes
will always (or most of the time) block, the application
can ask for an sqe to be issued async from the start. Note
that this flag immediately causes the SQE to be offloaded
to an async helper thread with no initial non-blocking
attempt. This may be less efficient and should not be used
liberally or without understanding the performance and
efficiency tradeoffs.

So how do I know if the request will block?

That’s not covered in the man pages.

io_uring under the hood

There’s a very good post (with significant overlap) by Cloudflare about how io_uring works. To summarize:

When an I/O request is first initiated, it is first attempted “non-blocking”, or perhaps better, inline. If the request can be satisfied immediately, it will be.

If not, then it needs to be completed in one of two ways:

via polling (e.g. polling a socket for readiness) -> retry inline
blocking (e.g. writing to a block device) on an async worker thread

If we KNOW that the request would not be satisfied immediately, we should instead prepare it for polling or queue it async immediately to save time.

So how can we tell? It’s not like there’s a programmatic way to access this information.

Testing with Block Devices

Sample Program

Focusing on block devices, let’s launch an AWS EC2 instance on kernel 6.12 with an additional EBS volume to serve as a block device or testing.

[ec2-user@ip-172-31-64-114 ~]$ lsblk
NAME          MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS
nvme0n1       259:0    0   8G  0 disk 
├─nvme0n1p1   259:2    0   8G  0 part /
├─nvme0n1p127 259:3    0   1M  0 part 
└─nvme0n1p128 259:4    0  10M  0 part /boot/efi
nvme1n1       259:1    0   8G  0 disk

We’re going to investigate /dev/nvme1n1. (Or /dev/nvme0n1, if you wanted to corrupt your EC2 instance.)

And let’s write a simple io_uring program to write some data to the block device:

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <liburing.h>
#include <linux/fs.h>
#include <sys/ioctl.h>
#include <unistd.h>

int main() {
    // Open with O_DIRECT to bypass kernel page cache
    int fd = open("/dev/nvme1n1", O_WRONLY | O_DIRECT);
    if (fd < 0) {
        perror("open()");
        return 1;
    }

    int block_size; // 512 on this instance
    if (ioctl(fd, BLKSSZGET, &block_size) == -1) {
        perror("ioctl()");
        close(fd);
        return 1;
    }

    struct io_uring ring;
    if (io_uring_queue_init(64, &ring, 0)) {
        perror("io_uring_queue_init()");
        close(fd);
        return 1;
    }

    int rc = 0;

    // Not strictly necessary, but good for performance!
    int r = io_uring_register_files(&ring, &fd, 1);
    close(fd);
    if (r) {
        perror("io_uring_register_files()");
        rc = 1;
        goto end;
    }

    // O_DIRECT writes must be aligned to and have size a multiple of the block size.
    // Let's just write one block.
    void *p;
    r = posix_memalign(&p, block_size, block_size);
    if (r) {
        errno = r;
        perror("posix_memalign()");
        rc = 1;
        goto end;
    }

    // It doesn't matter what data is actually contained in the buffer.

    struct io_uring_sqe* sqe = io_uring_get_sqe(&ring);
    io_uring_prep_write(sqe, /* registered fd index */ 0, p, block_size, /* offset */ 0);
    io_uring_sqe_set_flags(sqe, IOSQE_FIXED_FILE);

    if (io_uring_submit(&ring) < 0) {
        perror("io_uring_submit()");
        rc = 1;
        goto end;
    }

    struct io_uring_cqe* cqe;
    if (io_uring_wait_cqe(&ring, &cqe)) {
        perror("io_uring_wait_cqe()");
        rc = 1;
        goto end;
    }

    printf("%d\n", cqe->res);

end:
    io_uring_queue_exit(&ring);
    return rc;
}

sudo dnf install gcc liburing-devel, gcc write.c -luring, and then:

[ec2-user@ip-172-31-64-114 ~]$ sudo ./a.out 
512

cool. so…how do we answer our question?

eBPF

eBPF “is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in a privileged context”. Essentially, this lets us learn about what’s going on in the kernel.

We care about what’s happening in io_uring. We can leverage a tool called bpftrace to easily gain visibility: sudo dnf install bpftrace.

The Linux kernel defines “tracepoints”, which are predetermined locations in kernel code that can be acted upon. io_uring defines many:

[ec2-user@ip-172-31-64-114 ~]$ sudo bpftrace -l | grep tracepoint:io_uring
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_cqe_overflow
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_defer
tracepoint:io_uring:io_uring_fail_link
tracepoint:io_uring:io_uring_file_get
tracepoint:io_uring:io_uring_link
tracepoint:io_uring:io_uring_local_work_run
tracepoint:io_uring:io_uring_poll_arm
tracepoint:io_uring:io_uring_queue_async_work
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_req_failed
tracepoint:io_uring:io_uring_short_write
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_task_add
tracepoint:io_uring:io_uring_task_work_run

Writing to a block device is blocking I/O. If we hit tracepoint:io_uring:io_uring_queue_async_work, then we know that the request didn’t complete inline.

We can run the following bpftrace script to print io_uring tracepoints: sudo bpftrace -e 'tracepoint:io_uring:* { printf("%s\n", probe) }'

Simultaneously, rerun our program: sudo ./a.out. The bpftrace process should only write tracepoints hit from our program (because the only people insane enough to write a program with io_uring would just use userspace NVMe).

Attaching 17 probes...
tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_task_work_run

We didn’t hit tracepoint:io_uring:io_uring_queue_async_work, so that means the I/O completed inline!

As a sanity check, let’s try with IOSQE_ASYNC:

-io_uring_sqe_set_flags(sqe, IOSQE_FIXED_FILE);
+io_uring_sqe_set_flags(sqe, IOSQE_FIXED_FILE | IOSQE_ASYNC);

Recompiling and rerunning gives:

tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_queue_async_work
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_task_work_run

Indeed, the I/O completed async!

Therefore, we should not have used IOSQE_ASYNC.

When is it async?

Surely writing to a block device cannot always be done asynchronously. Let’s write a larger payload (still a multiple of the block size):

+int write_size = 512 * block_size;
 void *p;
-r = posix_memalign(&p, block_size, block_size);
+r = posix_memalign(&p, block_size, write_size);
 if (r) {
     errno = r;
     perror("posix_memalign()");
     rc = 1;
     goto end;
 }
 
 struct io_uring_sqe* sqe = io_uring_get_sqe(&ring);
-io_uring_prep_write(sqe, 0, p, block_size, 0);
+io_uring_prep_write(sqe, 0, p, write_size, 0);
 io_uring_sqe_set_flags(sqe, IOSQE_FIXED_FILE);

Running:

tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_task_work_run

Nope, still inline. Let’s try 513? :)

-int write_size = 512 * block_size;
+int write_size = 513 * block_size;

tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_queue_async_work
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_task_work_run

Now it’s async! Is it just a coincidence?

Maximum sector size

We expect the request the fail inline if the underlying block device can’t handle the payload in a single request. We can see this constraint by looking the block device’s parameters:

[ec2-user@ip-172-31-64-114 ~]$ cat /sys/block/nvme1n1/queue/max_sectors_kb
256

256 KiB = 256 * 1024 bytes = 262144 bytes.

512 * 512 bytes = 262144 bytes.

Those numbers match up! Hence, the request is handled async when we reach the maximum sector size.

To verify, let’s reduce this maximum, and make our write size 512 * block_size again (which previously completed inline):

tracepoint:io_uring:io_uring_create
tracepoint:io_uring:io_uring_register
tracepoint:io_uring:io_uring_submit_req
tracepoint:io_uring:io_uring_queue_async_work
tracepoint:io_uring:io_uring_cqring_wait
tracepoint:io_uring:io_uring_complete
tracepoint:io_uring:io_uring_task_work_run

This time, it completed async!

So, use IOSQE_ASYNC?

Let’s do a heuristic benchmark, with and without IOSQE_ASYNC. Let’s set the write size back to 1024 * block_size, and time using the x86 rdtscp instruction for precise measurements.

First, without IOSQE_ASYNC:

+#include <stdint.h>

 ...

+#include <x86intrin.h>

 ...

-int write_size = 1024 * block_size;
+int write_size = 512 * block_size;
 void *p;
 r = posix_memalign(&p, block_size, write_size);
 if (r) {
     errno = r;
     perror("posix_memalign()");
     rc = 1;
     goto end;
 }
 
 struct io_uring_sqe* sqe = io_uring_get_sqe(&ring);
 io_uring_prep_write(sqe, 0, p, write_size, 0);
 io_uring_sqe_set_flags(sqe, IOSQE_FIXED_FILE);

+unsigned int aux;
+uint64_t start = __rdtscp(&aux);
 
 if (io_uring_submit(&ring) < 0) {
     perror("io_uring_submit()");
     rc = 1;
     goto end;
 }

 struct io_uring_cqe* cqe;
-if (io_uring_wait_cqe(&ring, &cqe)) {
+r = io_uring_wait_cqe(&ring, &cqe);
+uint64_t end = __rdtscp(&aux);
+if (r) {
     perror("io_uring_wait_cqe()");
     rc = 1;
     goto end;
 }
 
-printf("%d\n", cqe->res);
+printf("%lld\n", end - start);

Running it multiple times with delays to get an average:

[ec2-user@ip-172-31-64-114 ~]$ seq 10 | xargs -I{} sh -c 'sudo ./a.out; sleep 1' | awk '{sum+=$1} END {printf "%.0f\n", sum/NR}'
5506019

And now, let’s enable IOSQE_ASYNC.

-io_uring_sqe_set_flags(sqe, IOSQE_FIXED_FILE);
+io_uring_sqe_set_flags(sqe, IOSQE_FIXED_FILE | IOSQE_ASYNC);

[ec2-user@ip-172-31-64-114 ~]$ seq 10 | xargs -I{} sh -c 'sudo ./a.out; sleep 1' | awk '{sum+=$1} END {printf "%.0f\n", sum/NR}'
5469197

5506019 cycles - 5469197 cycles = 36822 cycles. With this 2.5GHz processor, that’s a total of 14.729 microseconds. What would you do with all that time!

(Obviously, this benchmark is highly sensitive to large jitter. But, repeated attempts do show that setting IOSQE_ASYNC does marginally reduce execution time.)

Conclusion

Fine-grained tuning parameters like IOSQE_ASYNC really depend on your exact usecase. Sometimes, you need to reach deep into the kernel to understand your exact workload!

io_uring under the hood#

Testing with Block Devices#

Sample Program#

eBPF#

When is it async?#

Maximum sector size#

So, use IOSQE_ASYNC?#

Conclusion#