When doing 1M+ IOPS, you probably do not want to use OS IO schedulers due to the OS (timer & spinlock) overhead [1] and let the hardware take care of any scheduling in their device queues. But you're right about flattening the IO burst spikes via DB configuration, so that you'd have constant slow checkpointing going on, instead of a huge spike every 15 minutes...
All this depends on what kind of storage backend you're on, local consumer SSDs with just one NVMe namespace each, or local SSDs with multiple namespaces (with their own queues) or a full-blown enterprise storage backend where you have no idea what's really going in the backend :-)
Edit: Note that I wasn't proposing using an entire physical disk device (or multiple) for the low latency files, but just a part of it. Local enterprise-grade SSDs support multiple namespaces (with their own internal queues) so you can carve out just 1% of that for separate I/O processing. And with enterprise SAN arrays (or cloud elastic block store offerings) this works too, you don't know how many physical disks are involved in the backend anyway, but at your host OS level, you get a separate IO queue that is not gonna be full of thousands of checkpoint writes.
> local enterprise-grade SSDs support multiple namespaces (with their own internal queues)
What do you mean by namespaces here? Are they created by having different partitions or LVM volumes? As you mentioned consumer grade SSDs only have a single namespace, I am guessing this is something that needs some config when mounting the drive?
With SSDs that support namespaces you can use commands like "nvme create-ns" to create logical "partitioning" of the underlying device, so you'll end up with device names like this (also in my blog above):
/dev/nvme0n1
/dev/nvme0n2
/dev/nvme0n3
...
Consumer disks support only a single namespace, as far as I've seen. Different namespaces give you extra flexibility, I think some even support different sector sizes for different namespaces).
So under the hood you'd still be using the same NAND storage, but the controller can now process incoming I/Os with awareness of which "logical device" they came from. So, even if your data volume has managed to submit a burst of 1000 in-flight I/O requests via its namespace, the controller can still pick some latest I/Os from other (redo volume) namespaces to be served as well (without having to serve the other burst of I/Os first).
So, you can create a high-priority queue by using multiple namespaces on the same device. It's like logical partitioning of the SSD device I/O handling capability, not physical partitioning of disk space like the OS "fdisk" level partitioning would be. The OS "fdisk" partitioning or LVM mapping is not related to NVMe namespaces at all.
Also, I'm not a NVMe SSD expert, but this is my understanding and my test results agree so far.
Ah ok - so googling a bit on this, you do specify the size when creating the namespace. So if you have multiple namespaces, they appear as separate devices on the OS, and then you can mkfs and mount each as if its a different disk. Then you get the different IO queues at the hardware level, unlike with traditional partitioning.
Yep, exactly - with OS level partitioning or logical volumes, you'd still end up with a single underlying block device (and a single queue) at the end of the day.
All this depends on what kind of storage backend you're on, local consumer SSDs with just one NVMe namespace each, or local SSDs with multiple namespaces (with their own queues) or a full-blown enterprise storage backend where you have no idea what's really going in the backend :-)
[1]: https://tanelpoder.com/posts/11m-iops-with-10-ssds-on-amd-th...
Edit: Note that I wasn't proposing using an entire physical disk device (or multiple) for the low latency files, but just a part of it. Local enterprise-grade SSDs support multiple namespaces (with their own internal queues) so you can carve out just 1% of that for separate I/O processing. And with enterprise SAN arrays (or cloud elastic block store offerings) this works too, you don't know how many physical disks are involved in the backend anyway, but at your host OS level, you get a separate IO queue that is not gonna be full of thousands of checkpoint writes.