I wrote an extremely fast hot polled pipe in Rust for QEMU instrumentation. I’m ...

gamozolabs · on June 20, 2024

Here's the mempipe benchmark latency for core<->core: https://github.com/MarginResearch/cannoli/blob/main/mempipe/...

https://raw.githubusercontent.com/MarginResearch/cannoli/mai...

You can see a massive improvement for shared hyperthreads, and on-CPU-socket messages.

In this case, about ~350 cycles local core, ~700 cycles remote core, ~90 cycles same hyperthread. Divide these by your clock rate as long as you're Skylake+ for the speed in seconds. Eg. about 87.5 nanoseconds for a 4 GHz processor for local core IPC.

gamozolabs · on June 19, 2024

Anyways, since you express disappointment in ~1000 cycle cost. That's about right. The latency between cores is actually quite high and there's not much you can do about it, especially on a system like x86 which has extremely strong cache coherency by default. One thing that is really important to understand is that the ownership of cache lines dramatically affects the cost of memory.

For IPC, this effectively requires one thread writing to memory (thus, making that cache line modified to that core, and evicted from all other cores). Then, when the polling thread checks in on the line, it will have to demote that cache line from modified, to shared by flushing it out to memory (usually L2 or L3, but also writing out to memory). This causes some memory traffic, and constantly means that the cores are fighting over the same cache line. Since x86 is strongly ordered and caches are coherent, this traffic is extremely expensive. Think of a write as "tell all other cores that you modifed this memory, so they have to evict/invalidate their cache lines". And a read as "tell all cores to flush their cache lines if they're modified, then wait for them to tell me they're done, then I can read the memory". This effectively is a massively blocking operation. The simple act of reading the mailbox/ticket/whatever from another core to check if a message is ready will actually dramatically affect the speed the other core can write to it (as now that write is effectively full latency).

There are some tricks you can do to get extremely low latency between cores. One of them, is making sure you're on cores that are physically near each other (eg. on the same processor socket). This is only really relevant on servers, but it's a big thing. You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.

Another trick that you can do, is have both threads on the same core, thus, using hyperthreads. Hyperthreads share the same core and thus a lot of resources, and are able to actually skip some of the more expensive coherency traffic, as they share the same L1 cache (since L1 is per-core). The lowest latency you will be able to observe for IPC will be on the same core with hyperthreads, but that's often not really useful for _processing_ the data, since performance will not be great on two busy cores. But in theory, you can signal a hyperthread, the hyperthread can then go and raise some other signal, while the original hyperthread still continues doing some relevant work. As long as one of them is blocking/halted, the other won't really be affected by two things on the same thread.

Finally, the most reasonable trick, is making sure your tickets/buffers/mailboxes/whatever are _not_ sharing the same cache lines (unless they contain data which is passed at the same time). Once again, the CPU keeps things in sync at cache line levels. So having two pieces of data being hammered by two cores on the same cache line is asking for hundreds of cycles per trivial data access. This can be observed in an extreme case with many core systems, with multiple sockets, fighting over locks. I've done this on my 96C/192T system and I've been able to get single `lock inc [mem]` instructions to take over 15,000 cycles to complete. Which is unreal for a single instruction. But that's what happens when there's 200-500 cycles of overhead every single time that cache line is "stolen" back from other cores. So, effectively, keep in your head which state cache lines will be in. If they're going to be modified on one core, make sure they're not going to be read on another core while still being written. These transitions are expensive, you're only going to get your 3-4 cycle "random L1 data hit performance" if the cache line is being read, and it's in the exclusive, modified, or shared state, and if it's being written, it has to be exclusive or modified. Anything else and you're probably paying hundreds of cycles for the access, and thus, also probably hurting the other side.

Ultimately, what you're asking from the CPU is actually extremely complex. Think about how hard it would be for you to manage keeping a copy of a database in sync between hundreds of writers and reads (cores). The CPU is doing this automatically for you under the hood, on every single memory access. It is _not_ free. Thus, you really have to engineer around this problem, batch your operations, find a design that doesn't require as intense of IPC, etc. On more weakly ordered systems you can use some more tricks in page tables to get a bit more control over how cache coherency should be applied for various chunks of memory to get more explicit control.

gamozolabs · on June 19, 2024

Oh sliding in here late, but it’s also extremely important to pin your threads to specific cores. I kinda only think about computers in this mode so I didn’t bring it up, but the kernel will effectively randomly schedule your process to different physical cores. If you are doing intense IPC or otherwise relying on specific cache usage, getting assigned to a different core is a massive loss of existing state which takes time to recover from!

troad · on June 19, 2024

> You can actually map out the physical processor layout, including on a single die, based on the latency between these cores. It's quite subtle and requires low noise, but it's really cool to map out the grid of cores on the actual silicon due to timing.

This is a very cool comment in general, but I'm intrigued by this bit in particular. I'd love to see an implementation, if anyone knows of any.

gamozolabs · on June 19, 2024

There’s one big public work by utexas that covers this in a few places! https://sites.utexas.edu/jdm4372/2021/05/27/locations-of-cor...

See additional references for PDFs.

I’ve also played around with this in my research kernel, but never thoroughly enough to write up. Something I should revisit, I think it’d be a really fun thing to discuss and work on. Doing this level of timing requires doing BIOS configuration to make the CPU as deterministic as possible (turn off dynamic throttling, make sure you’re thermally keeping the CPU stable, etc).

I always highly encourage people to write advanced benchmarks. It’s a great way to learn how computers work!

dwattttt · on June 19, 2024

Sticking with x86, I'm pretty sure CPUID can tell you what the topology of all the cores you can query on are in. Not that I'd tell anyone not to infer it from timing, that just sounds like fun.

gamozolabs · on June 19, 2024

It can tell you specified topology, like cores, threads, NUMA nodes. But it can’t tell you the physical locations of cores. Processors are binned based on what cores are functional after fabrication, thus your 12 core processor is probably a 16-18 core processor, and the 4 disabled cores are “random” based on manufacturing defects. Knowing this is completely unimportant, but cool. Ultimately yes, cpuid will tell you and relevant topology, since anything beyond this requires likely trillions of iterations to even detect.

fanf2 · on June 19, 2024

It’s common in Anandtech benchmarks, eg https://www.anandtech.com/show/15578/cloud-clash-amazon-grav...

Dunno what software they use…

loeg · on June 19, 2024

For example, https://jprahman.substack.com/p/sapphire-rapids-core-to-core...

dwattttt · on June 19, 2024

You mention the most reasonable trick is to just avoid hammering/read-write the same cache lines; I guess you didn't hit this as a need because your QEMU instruction pipe was fast enough, but would you batch up your events along cache lines, fill up a line, and signal it's free to the consumers instead?

gamozolabs · on June 19, 2024

Yeah. So I forget the original tuning I did for that project. But, I fill up a buffer which is on its own cache line (Chunk) and then signal that the chunk is ready for ownership on the other side, thus sending it. I’m not sure why the signaling atomics aren’t on their own cache lines, I imagine I tried both and this was faster? There’s also a chance I never tried it because I felt I didn’t need it? I’m not sure!

Edit: Yep, looking at this I think I see some room for improvement in design. I'm getting about 60-70 ns/iter right now on my Intel(R) Core(TM) i9-9900KS CPU @ 4.00GHz with turbo on, and a loaded up system. This code is 2 years old and I've gotten a lot better at designing stuff like this. But, it's still really good. It's something I'd like to revisit at some point. The main bottleneck I think I see is `cur_seq` should be on its own cache line, as that's the only heavily thrashed value between cores. Most of the others are primarily in the shared cache line state, until a writer hits a mailbox. This design searches linearly through mailboxes for work to do, perhaps it would be faster to save the previous index? I'm also not sure I always need to have readers strongly sequenced, such that they can process more than one message a time, but this could have been a requirement of the way I was using it? While in theory storing/caching some information is great, in practice it often means unpredictable and dependent loads to use them. Eg, storing the previous index now means the CPU has to fetch the previous index from memory, and then loop on it. The compiler also has to treat this as a "volatile" value, and thus will struggle to do things like loop unrolling and bounds checking removal. With a low N (~4-16, number of mailboxes), it's probably always faster to just `for _ in 0..16` instead of doing smarts here, as the processor can do all 16 of those effectively in parallel by the time the CPU has even loaded the "cached index". Once again, tuning is extremely specific to a workload and parameters.

For maximum performance, it's pretty much always required to try out some un-ergonomic API. In this case the sequencing is nice, but perhaps I could rewrite the code that uses mempipe to not care and handle sequencing someplace else? I forget. In theory I could have multiple banks of tickets, and switch between them on some level of pressure or timeout. Using the same signal line between the writer and the reader (eg, they both must write to client_owned) is probably a mistake. I think it'd be better to use two indexes, bumped on one side when a writer writes, and bumped on the other side when a reader is done. This would allow better "bursting" of data in my opinion? Who knows, even a data structure as simple as this effectively requires building it and trying to really determine how it performs. So many resources in use in the CPU and it's a tough balance in your head.