I know it's not the article topic but I want to talk about mobile Zen4.
Mobile Zen4 is obviously powering the ASUS ROG Ally and other handheld Steam Deck clones, and I'm very excited about it, but it's almost impossible to get laptops with it installed! The ones I found also have a discrete GPU which defeats the point. Having a low power draw, thin and light laptop that can also run fairly recent AAA games - amazing!
In the end I got tired of waiting for a Thinkpad T14s with Zen4 and got the previous generation, which is about 20% slower but still pretty good.
On the other hand, while in all the previous years there has been a delay of around one year between the appearance in laptops of some Ryzen model and its appearance in small NUC-like computers, this year a large number of companies have introduced small computers with AMD Phoenix (Ryzen ? 7?40) immediately after its launch.
I am happy about this, because instead of carrying a big laptop, I prefer to carry a NUC-like computer, a 17" portable monitor and a compact keyboard, which all together are both lighter and much cheaper than a mobile workstation laptop, while being faster and more comfortable for working and also having more peripheral interfaces.
> The ones I found also have a discrete GPU which defeats the point. Having a low power draw, thin and light laptop that can also run fairly recent AAA games - amazing!
So, disable it? Most Zen4 laptops I've seen so far still fit the general thin and light category (albeit, at 15" vs 11-13") and is paired with a mid-range Radeon or GeForce (usually 4060/4070). Buy the laptop and put it in Eco mode in Windows (doesn't multiplex with discrete) or disable it in bios.
The Framework comes with a Zen4 cpu. It's thin and light and the GPU can be disabled:
As someone who owns an AMD laptop (with Radeon discrete), I should point out that AMD's muxing algorithm is pretty conservative, unless you put it into performance mode. Video playback, basic GL/VK/DX operations, desktop usage/web browsing, etc sticks to the integrated GPU. I've only seen the discrete kick in once obvious heavy 3d operations were running (video games, benchmarks, 3d modeling, etc).
FYI the framework 13 doesn't even have a dGPU, so it would fit it seems. Not available yet, though, AFAIK. I'm certainly waiting for mine (preordered a couple months back).
Which is specifically why I address thinness and weight independent of the disablement. You could read passed the first sentence next time.
As to cost, AMD generally bundles their GPU/CPUs so you aren't saving much by removing it. Certainly not anything the manufacturer is going to pass on to you, at least.
It might be thin and light enough for you but I’m guessing it could be thinner or lighter
Can they really provide a whole dedicated GPU, the extra circuitry on the motherboard to support it, a whole other cooling system, more powerful power delivery and the battery size to compensate all without costing any extra money or weight?
You vastly overestimate the TDP of a midrange laptop GPU. And/or don't understand general cooling infrastructure in modern laptops.
As to the size, what is even the point of arguing. Just go look at the specs yourself and decide if it's thin and light enough. I gave an option that the majority would consider so. If it's not for you, fine. Find something else or accept it doesn't exist.
You can keep moving the goalposts infinitely for someone who needs a hypothetically thinner and lighter laptop, it doesn't matter in the grand scheme of things.
I looked up the TDP of some laptop GPUs and they're all over 35W. The maximum TDP of the CPU the commenter was talking about, which has good integrated graphics, is 30W. The TDP of the mobile 4070 you mentioned is 115W(!)
> If it's not for you, fine. Find something else or accept it doesn't exist.
You seemed to be very dismissive of the person you were replying to, implying that it's not an issue at all because you can just disable it. I'm pointing out that it isn't always the case. I never said that your solution is completely useless, it's just not a good solution for most people
Also, I'm pretty sure the Framework doesn't have a built in dedicated GPU like you're saying, but an optional detachable module for one (that makes it more viable in this case though)
> I looked up the TDP of some laptop GPUs and they're all over 35W. The maximum TDP of the CPU the commenter was talking about, which has good integrated graphics, is 30W. The TDP of the mobile 4070 you mentioned is 115W(!)
None of which is contained inside of the Framework laptop. Awesome, glad you found some random stats to dismiss a hypothetical laptop I mentioned.
> You seemed to be very dismissive of the person you were replying to
I offered an alternative to their complaint. But sure, instead of "dismissing" them, I could leave them to complain and whine. Or follow you down ever dwindling definitions of "thin" and "light" until you prove yourself correct about some made up argument in your head.
Either way, I'm done with your Trump-style snippet-based retort tactics. Go buy a Steam Deck clone, since that's your's and the OP's only choice, apparently.
If you mean the framework, a) it's not out yet b) it's not 850usd in my country (try double that, after adding ram/disk/modules). But it is a very nice laptop for sure.
The others are things like Razor Blade and ASUS Zephyrus.
> AMD should be commended for not wasting area and power chasing bragging rights. [...] Matching Golden Cove’s 6-wide decoder would have been great for bragging rights, but won’t affect performance much.
Except that isn't true either except in the very narrow case of "front end for the sake of front end":
Lets say you have two, otherwise identical, use cases: You have the P and E Intel cores, and you have the equivalent over at AMD, Zen and Zen-c.
The respective E cores are approximately 50-60% of the size of their matching generation P cores, yet use about 40-50% of the power to get roughly the same performance (ie, 2x E cores vs a single P core with both threads running sometimes is comparable to each other, especially with all the anti-meltdown/spectre protections).
So, if you compare an P core, with two threads (which is 100% of the size of an P core, duh), or two E cores, each with a single thread, (which is 100% to 120% the size of that single P core), you have near equivalent performance often in less power usage.
If I consider the otherwise equivalency of a P core vs 2x E cores, the front end has widened (indeed, 2x E core has more frontend than 1x P core), and I have not significantly widened the backend (E cores from both companies are tiny, and only some of it can be contributed to less cache area).
I suspect AMD's end game is to eventually end dual-thread cores, and move to the model being done in Zen-c. One, it removes, forever, an entire class of possible security bugs (after all, Intel's knee-jerk reaction was to reissue series 8 as 9, with threads disabled and all cache assigned to first/only thread); two, it would make extremely high performance desktop and server chips that are better matched towards modern workloads; if AMD continues to make threaded cores, they will be 4/8 thread cores that are sold on specialty Epycs only (ie, the reverse of the current lineup of specialty Epycs being Zen4c, and no desktop Zen4c)
"But, hey, what about single-threaded-only workloads?" you might ask.
Workloads like that are generally memory latency bound first, memory bandwidth bound second, and then IPC bound third: AMD solved that by having an absolutely gigantic L3. Games are the poster-child of poorly optimized software, and AMD's giant L3 cache does magic here.
If my next chip was a 16c/16t E-core desktop chip (to replace your common 8c/16t, the kind I recommend to desktop users, and use myself), I'd be happy.
>> I suspect AMD's end game is to eventually end dual-thread cores
I think not. They're going to widen the fetch/decode in Zen 5. That will help refill the pipe after all those misprints and cache misses, but my not help latency in refilling. One of the best ways to handle stalls is to let "the other thread" execute. I suspect Zen 5 will get huge gains on branchy multi-threaded code but more modest gains on single thread. They're after performance per watt and per dollar, while Intel keeps aiming for top single thread performance and market segmentation - two things some of us don't care about.
That said, I'm sure there is a market for a 4 or 8 core chip without SMT that has 50 percent higher single thread performance, and I won't be surprised if Intel provides such a chip.
Can someone explain me why, if a branch is mispredicted enough, a CPU can't execute both the true and the false side? Then throw away half the work when the direction of the branch is finally known. The cost would be more unretired instructions, but the cost of prediction failure would also be lower.
You need to double the hardware... which fair enough modern CPUs are super scalar so they already do that. The issue is that there are a lot of branches in the code, the CPU would need to keep a lot of redundant hardware. All that extra hardware comes with increased power usage and consequently heat dissipation. On top of that modern branch predictors are pretty amazing, so you would need to get a lot of benefit to make this worth it...
So the trade is, you can get slightly better latency due to misprediction masking by executing both branches at the cost of massively decreased throughput (b.c. you are using extra hardware to execute branches of the same thread vs different threads), increased power and heat dissipation, and increased cost due to additional hardware. Note that cost, power, and usage are generally the constraints you want to satisfy, so you will generally get a significantly slower CPU that probably has worse latency due to lower clocks, less cache, and whatever other tradeoffs you need to make to fit into the power/heat/cost budget.
Under your proposed scheme, from the time between a branch being issued and resolved around half the subsequent instructions executed by your CPU will be retired and the other half will be discarded. So, all else being equal, performance is 50% compared to non-branching code.
With branch prediction, after a single branch all the speculatively-executed instructions are either retired or discarded. If a branch predictor randomly guesses for each branch, it will achieve 50% accuracy...so, averaged across many branches, 50% of speculatively executed instructions are retired.
So the worst possible branch predictor has equivalent performance to your proposed scheme, and it’s quite a bit simpler to implement (which translates to increased performance). If your branch predictor is any better than blind guessing, it will achieve better performance (by the benchmark of % of correctly-speculated instructions). Given that modern branch predictors achieve 95%+ accuracy on typical workloads, it’s clear why processor designers choose branch prediction rather than executing both pathways.
I'm not sure this looks at the correct resources. If I read TFA, they have an IPC of 0.85 on the benchmark, but a theoretical IPC limit of 6. So 50% usage would be IPC=3, a lot better than the benchmark. So execution resources aren't the bottleneck.
Also, this analysis skips the easily predictable jumps like loops. I remember 90% prediction being relatively easy to reach in most programs just because of loops. A better way would be to let the predictor not only respond with taken/not taken, but also with high/low confidence. If confidence us low, then a CPU could spend some unused part of IPC.
All of this probably wont work in practice, of course.
The bigger thing you're missing is that theoretical IPC of 6 can only happen if you are using a bunch of different execution ports. Most branches will have similar instruction workloads afterwards so the different sides of the branch will be fighting each other.
Some CPUs have attempted to do this, but the cost is enormous.
Conditional branches are very frequent, so a modern CPU that has much more than one hundred instructions in flight might have ten or twenty branches past which it executes instructions speculatively.
So it will not have two sides executed in parallel, but each side will be split in two at the next branch, then again at the next branch. The number of branches that must be executed speculatively in parallel grows exponentially, together with the ratio between executed instructions that must be discarded and useful instructions whose results are retained after the conditions of the branches are resolved.
So such a CPU would have a huge power consumption due to doing mostly useless work, which would reduce its performance much more than what is lost during branch mispredictions when only one side is executed.
For unpredictable branches, all modern CPUs offer conditional move or conditional select instructions, where both sides of a conditional expression are executed and then the result for the true condition is selected.
It’s definitely a thought I’ve had from time to time as well. I’m not a chip designer but my hunch is that it’s because one of the branches is likely to result in even more branches, likely very soon after that initial branch on one of the paths taken (ie you won’t get far before hitting another branch). Example is a condition that causes you to leave the loop - on every iteration of the loop you’d be trying to speculatively execute what happens after the loop.
That means you’ve got an exponentially increasing number of speculative execution happening and it’s almost all worthless (ie 1/n^2) while competing for resources with the one real path. This is ignoring the practical challenges of managing all that speculation and any fun additional speculative attacks that might arise again.
The best case for something like this is where you have a bunch of unpredictable branches. In those cases, it’s better for the compiler to convert it to “branchless” code (eg using the x86 cmov instruction).
There’s also major performance problems unrelated to branches which are data dependencies. I think those are even more impactful these days as they completely inhibit the length of a deep pipeline. Those can only really be solved by fixing code.
Ideally try to design your loops so you're accessing memory sequentially (better cache locality) and predictably (i.e. don't branch on a condition that changes in every iteration). Using a branchless algorithm is only really beneficial if the branch is constantly being mispredicted.
As always, profile first so you don't fall into the premature optimization hole.
You should run your critical loops through perf or a similar tool and get statistics for branch misprediction. If you have a lot of misprediction, you should think about doing branchless stuff. If you don't have a lot of misprediction, the branchy code will be faster.
Sometimes you can guess whether the data the loop will hit will be predictable or not. The same applies; if you expect it to be predictable, then don't try to remove branches. If you expect it to be unpredictable, try to remove branches.
Note that you will need to teach yourself your expectations; the branch predictor might be better or worse than you expect it to be. Modern branch predictors are extremely complex, often taking more die space than everything except cache. Don't assume that just because you can't see a pattern in the data, the branch predictor won't be able to.
This isn't necessarily game-specific. I'm not up to speed with current gen, but in the past the simple answer was "it depends", as usual. :-)
There are times where doing some bit-twiddling hacks outperform branching, or where (partial) loop-unrolling is faster than the higher code density of a firm loop, but in the end for trivial cases, the compiler often, but not always, would do these things behind the scenes if you tell it what CPU you want to target. And if you really want the best performance in a particularly hot section, you just have to benchmark every possible implementation and pick on a case-by-case basis, or even provide two or three different implementations and pick the best one at runtime.
(author here) I think it's not useful to eliminate branches in hot loops. These games have giant instruction footprints and a high branch rate. A loop will probably fit within the L1 BTB and uop cache, and probably won't benefit from eliminating branches.
Exception would be a branch that's near impossible to predict, like one that depends on a randomly generated value or otherwise doesn't correlate well with global history.
Branches are still valuable in many cases but it can be valuable, situationally, to replace them with things like cmovs instead. You'll need to benchmark it to be sure - it's not like the old days where branches were always worse.
Branch hoisting, on the other hand, is a win like 90% or more of the time (I can imagine it being a loss if it causes stuff to fall out of icache), which is why most compilers will try to do it automatically.
Worse than branches (generally) are indirect calls (i.e. virtual methods, function pointers) that don't predict well, so you especially want to avoid those in tight loops and if you can't avoid them you want to sort your data so that the call target doesn't change frequently.
TLDR: worry about branches in your hot loop kinda last and only if the profiler indicates you have a random walk through your branches AND you’ve verified the compiler emitted branches. There are more impactful optimization techniques to worry about first and often with PGO and LTO the compiler is going to be a lot better about making relevant code branchless on your behalf without you having to think about it.
I’d say that’s something you should only do once you understand your architectural bottlenecks. These are bottlenecks that appear because of how you’re doing computation and moving data around. They won’t show up on any profiler because the profiler is telling you “this is where an implementation is slowest” and not “there’s a better implementation altogether” - the latter is art as it requires a good working understanding of computer architecture and playing around with algorithms that aren’t taught in text books (or adapting them more optimally for your problem domain).
If you’ve done all other major architectural optimizations and you’ve identified a hot loop where the compiler is inserting branch code which is mispredicted, an explicitly branchless version will help. But remember - compilers are very good often at recognizing the opportunity for a branchless version and changing your code to that, so if you’re in AOT land with a major compiler (clang/llvm, gcc, msvc, Intel) there’s a good chance your algorithm might already be branchless. In fact, if you use something like PGO on a realistic workload, the compiler is even more likely to emit the correct machine code in way more places without you having to manually alter code by hand.
More impactful low level optimizations often are:
Optimizing for cache locality through something like entity component systems (ECS). Basically using a struct of arrays and no polymorphism instead of an array of polymorphic structs. This is a popular one for games and works really well. Similarly, make sure that data being accessed in a hot path is all linearly laid out in processing order if possible.
Applying SIMD to your hot loop processing. The compiler does have autovectorization that can work well with ECS but requires you to write your scalar code carefully to trigger it (ie your scalar code has to look similar to the vector version in important ways). Check the assembly to see if the compiler is doing it and rewrite using compiler intrinsics when it’s not. Similarly try to minimize data dependencies between your loops. Eg instead of
That’s very very similar to what a SIMD loop would look like and a compiler is likely to recognize it. Even if it doesn’t autovectorize, you’re now doing 8 summations “concurrently” because there’s no data dependency for those 8 additions on every loop iteration, so the CPU will execute them without waiting for the result of any other summation. Make it unrolled enough to exhaust the execution units and you’ve guaranteed that the beginning of the loop will start after the previous data dependency has been resolved (ie the CPU will be executing that loop at maximum speed). Again, this kind of stuff is silly to worry about until you figure out where your hot loop is for a given implementation.
Knute writes “premature optimization is the root of all evil”. Implementation optimizations can get you very small improvements once you’ve exhausted low hanging fruit (basically implementation mistakes). Switching implementations can often net you order of magnitude improvements.
Looking into branchless is probably often premature. Organizing your data efficiently and using good algorithms is not. Data dependencies in loops is probably in the middle there - they optimize your current implementation but the improvement for that hot loop can be drastically more significant than a branchless design (fixing data dependencies can easily 10x your performance, making branchless for a mispredicted loop can net you maybe 20% or so for that hot loop).
Note that GCC will vectorize the first version if you add `-ftree-vectorize` to your optimization arguments. In order to vectorize the same loop with floats, (particularly relevant for gamedev) you need to add `-ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math`.
For gamedev, where numerical correctness is often secondary to speed, I would expect to use `-O3 -ffast-math`. In games where deterministic simulation is important, (eg, multiplayer) you may need to use only `-O3` in the physics simulation and only use `-ffast-math` in the graphics portion.
Anyway, to your real question, an entity component system-based middleware, hopefully in a language that isn't C++, forces you to author your code in a way where branches can be "moved" to remove the problematic memory access and execution pattern. So maybe it's not about specific C++ method branches, which may be beneficial to your employer but never beneficial to you or your career, but it's about the architecture of the thing that still has innovation opportunities that could really matter.
You probably aren't interested in the even bigger picture conversation, which is that some games are so ill specified that it is impracticable to use an ECS middleware. Games that are really well specified (i.e. clones & sequels) might thrive on ECS, but the ones we end up playing are authored by companies with real budgets and take years to develop, so the compute platform will get faster than your micro-optimization.
Mobile Zen4 is obviously powering the ASUS ROG Ally and other handheld Steam Deck clones, and I'm very excited about it, but it's almost impossible to get laptops with it installed! The ones I found also have a discrete GPU which defeats the point. Having a low power draw, thin and light laptop that can also run fairly recent AAA games - amazing!
In the end I got tired of waiting for a Thinkpad T14s with Zen4 and got the previous generation, which is about 20% slower but still pretty good.