I don't have any good benchmark to link to off the top of my head, but I'll hand...

I don't have any good benchmark to link to off the top of my head, but I'll handwave in the general direction of Agner Fog's guide, which is accurate in my experience (and generally a great resource): https://www.agner.org/optimize/microarchitecture.pdf

From the multiple "Bottlenecks in AMD Zen" sections, a common point is

>The limiting fetch rate of up to 16 bytes per clock is a very likely bottleneck for CPU- intensive code with large loops

Although admittedly, if your small hot loops fit in the uop cache that does largely mitigate the fetch/decode problem