If you’re already competent with systems languages in keen to _go on a wild deep dive_ into this sort of perf engineering, where’s a good place or good resources to get started ?
It's very brief (so if you only have time to read one thing you won't waste a single minute: it's to the point) and (perhaps surprisingly, given its publication date) highly relevant: In fact, right after reading Section IV.C (AMD K5) you should be able to immediately jump to and understand the microarchitectural diagrams of, say, AMD Zen 2:
http://web.archive.org/web/20231213180041/https://en.wikichi...http://web.archive.org/web/20231213180041/https://en.wikichi...
...and any other contemporary CPU.
Smith & Sohi paper is going to give you great intuition for what's essentially implemented by all modern CPUs: restricted dataflow (RDF) architecture.
For context, all superscalar dynamically scheduled (out-of-order) CPUs implement RDF, introduced by Yale Patt's research group's work on High Performance Substrate (HPS). This work pretty much defined the modern concept of a restricted dataflow CPU: breaking complex instructions into smaller micro-operations and dynamically scheduling these to execute out-of-order (or, to be more specific, in a restricted dataflow order) on multiple execution units.
(If you're curious, the "restricted" part comes from the finite instruction window delimiting the operations to be scheduled, which stems from the finite size of all the physical resources in real hardware, like the ROB/reorder buffer. The "dataflow" comes from having to respect data dependencies, like `A = 1` and `B = 2` having to execute before `C = A + B`, or `MOV A, 1` and `MOV B, 2` having to execute before `ADD C, A, B`; but note that you can freely reorder the aforementioned moves among themselves as long as you execute them before the add: a schedule/execution order of `MOV B, 2; MOV A, 1; ADD C, A, B` is just as valid).
For the historical background, see "HPS papers: A retrospective", https://www.researchgate.net/publication/308371952_HPS_paper...
(Minor per peeve warning:) It also sets the record straight w.r.t. to the RISC being orthogonal to the historical development of modern superscalar out-of-order CPUs: In particular, it's worth noting that the aforementioned micro-operations have absolutely _nothing_ to do with RISC! Another great resource on that is Fabian's video "CPU uArch: Microcode", https://www.youtube.com/watch?v=JpQ6QVgtyGE (also worth noting that micro-operations and microcode are _very_ different concepts; that's also very well covered by the video).
Another good, succinct description of the historical background is the 2024 IEEE CS Eckert-Mauchly Award (Wen-mei Hwu was a PhD student in the aforementioned Yale Patt's group): "Hwu was one of the original architects of the High-Performance Substrate (HPS) model that pioneered superscalar microarchitecture, introducing the concepts of dynamic scheduling, branch prediction, speculative execution, a post-decode cache, and in-order retirement." - https://www.acm.org/articles/bulletins/2024/june/eckert-mauc...
On a side note, load-store architecture introduced by CDC 6600 (designed by Seymour Cray in the 1960s) is sometimes mistaken for RISC (which came decade+ later, arguably introduced in IBM 801 designed by John Cocke in the late 1970s/early 1980), https://en.wikipedia.org/wiki/Load%E2%80%93store_architectur...
One could say load-store architecture does have an influence on compiler backends implementation, after a fashion (thinking of instruction scheduling, with a complex operation combining, say, LOAD with arithmetic operation OP, broken down in the scheduling models as separate LOAD and OP operations/effects).
These are absolutely fantastic--I've followed all of these a few years back and can vouch for the quality and being up-to-date or at least decades ahead of the general computer architecture textbooks: the lectures and readings cover contemporary work including ISCA and MICRO papers that have only recently found their way to the implementation in production CPUs (e.g., the hashed perceptron predictor that's one of the branch predictor units in AMD Zen 2, http://web.archive.org/web/20231213180041/https://en.wikichi...).
There are more, topic-specific texts that are very good, e.g., A Primer on Memory Consistency and Cache Coherence, Second Edition; Synthesis Lectures on Computer Architecture 15:1 (2020); Vijay Nagarajan, Daniel J. Sorin, Mark D. Hill, David A. Wood; open access (freely available) at https://doi.org/10.2200/S00962ED2V01Y201910CAC049 but when you are at this point you're likely going to be good at finding the relevant resources yourself, so I'm going to leave it to you to explore further :-)