ARMv9.0 is very similar to ARMv8.5 (9.0 supersets 8.5 with SVE2, TME, TLA, and CCA), so it's not a massive deal. SME implies v8.7 which is basically identical to v9.2 except for those couple extensions previously mentioned.
I wonder if there is licensing at play though. Apple may have gotten a really great licensing deal on ARMv8 that they wouldn't be offered for ARMv9.
From what I’ve read previously, Apple has a special licensing deal already as they were part of founding Arm, although I don’t know if there’s any details on exactly how that works.
I believe its an architectural license which lets them design their own cores based on the the ARM instruction set. I think a few other companies may have this license but its not disclosed.
Apple cofounded ARM for use in the Newton product line; they released new Newton products from 1993-97 and discontinued them in 1998. They then used ARM again for the iPod, released in 2001.
They didn't design the iPod ARM SoC though, nor the ones in iPhones for quite some time, and the microcontrollers in Macs for power management and such were not ARM. (I mean, some of them might've been, but the one I'm thinking of was SH or something.)
> Apple didn't use ARM for something like a decade or two after that.
They used the ARM610 in the Newton in 1993 (ARM was founded in late 1990) and then an 8 year gap to the iPod in 2001 (ARM7TDMI which are ARM designs.) Their first in-house ARM design (I believe) is the iPhone 4 in 2010.
They definitely didn't "architect/design ARM" for nearly a couple of decades after founding ARM, yeah, but they did use them.
Yes. Watching it unfold, the whole misinformation self spread, and back into its loop is both interesting and tiring. It took months and some hard work to camp down Unified Memory being SRAM and something special. But this ARM deal? 5 years and counting.
It is two point;
Apple has an architectural license, which somehow became a "special deal" as if they are the only one doing it. They are not and even the Amphere Computing has one.
Apple was the part of the founder of ARM, and somehow this gave people impression they have a "special deal".
And had hajile not step up and provided the ARMv9 being a superset of ARMv8 etc etc I would have to spend some time to look up the detail of superset just to stamp out these type of non-sense. ( Each ARMv8+ and v9 have way too many features and optional extensions I dont even remember which is which )
And this is not the first time YouTuber Vadim Yuryev gave something out that is completely wrong.
Does anyone have insight into why arm CPU vendors seem so hesitant about implementing SVE2? ~They seem~ *Apple seems to have no issue with SSVE2 or SME.
Edit: Only Apple has implemented SSVE and SME I think.
What is the measurable benefit to implementing 128b SVE2? Like, ARM has CPUs that implement that, and it's not even disabled on some chips. So there must be benchmarks somewhere showing how worthwhile it is.
And implementing 256b SVE has different issues depending on how you do it. 4x256b vector ALUs are more power hungry than generally useful. 2x256b is only beneficial over 4x128b if you're limited by decode width, which isn't an issue now that A32/T32 support has been dropped. 3x256b would probably imply 3x128b which would regress existing NEON code. And little cores don't really want to double the transistors spent on vector code, but you can't have a different vector length than the big cores...
I'd say that the theoretical ability to gang units together would be appealing.
If you have four 128-bit packed SIMD, you must execute 4 different instructions at once or the others go to waste. With SVE, you could (in theory) use all 4 as a single, very wide vector for common operations if there weren't a lot of instructions competing for execution ports. You could even dynamically allocate them based on expected vector size or amount of vector instructions coming down the pipeline.
Additionally, adding two 2048-bit vectors using NEON (128-bit packed SIMD) would require 16 add instructions while SVE would require just one. That's a massive code size reduction which matters for I-cache and the frontend throughput.
I don't see how this would work out beneficially. Let's say your hardware can join 4x128b units as a virtual 512-bit SVE SIMD unit. This means you have to advertise VL as 512bit for reasons of consistency. Yes, you will save some entries in the reorder buffer if you encounter a single SVE instruction, but if the code contains independent SVE streams, you will be stalled. Moreso, not all operations will utilize all 512 register bits, so your occupancy might suffer. The only scenario I see this feature working out is if you are decode or reorder buffer limited. Neither is a problem for modern high-performance ARM cores. With x86, it might be a different story. From what I understand, AVX512 instructions can be quite large.
Modern out-of-order cores are already good at superscalar execution, so why not let them do their job? 4x128b units give you much more flexibility and better execution granularity.
On x86 at least, the cost of OoO is astonishing - more pJ per instruction dispatch than the operation itself. Amortizing that over more operations is the whole point of SIMD. I have not yet seen such data for Arm.
That aside, see the "cmp" sibling thread for a major (4x penalty) downside to 4x128.
Yes, OoO is expensive — after all, that is the cost of performance. Very wide SIMD is great for energy efficiency if that is what your compute patterns require (there is a good reason why GPUs are in-order very wide SMT SIMD processors). Is this the best choice for a general-purpose CPU? That I am not so sure about. A CPU needs to be able to run all kinds of code. A single wide SIMD unit is great for some problems, but it won't deliver good performance if you need more flexibility.
Could you point me to the "cmp" thread you mentioned? I don't know where to look for it.
> I agree with you we do not only want "very wide SIMD", and it seems to me that 2x512-bit (Intel) or 4x256 (AMD) are actually a good middle ground.
I'd already classify this as "very wide". And the story is far from being that simple. Intel's 512-bit implementation is very area- and power-hungry, so much so that Intel is dropping the 512-bit SIMD altogether. AMD has 4x add units, but only two are capable of multiplication. So if your code mostly does FP addition, you get good performance. If your workflows are more complex, not so much.
The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation. And that on a core that runs lower clock and has less L1D bandwidth. Flexibility and symmetric ALU capabilities seems to be the king here.
Ah, that is what you meant. Thank you for linking the post! My comment would be that this is not about 128b or 256b SIMD per se but about implementation details.
There is nothing stopping ARM from designing a core with more mask write ports. Apparently, they felt this was not worth the cost. Other vendors might feel differently. I'd say this is similar to AMD shipping only two FMA units instead of four. Other vendors might feel differently.
For very wide, I'm thinking of Semidynamic's 2048-bit HW, which with LMUL=8 gives 2048 byte vectors, or the NEC vector machines.
AFAIK it has not been publicly disclosed why Intel did not get AVX-512 into their e-cores, and I heard surprise and anger over this decision. AMD's version of them (Zen4c) are a proof that it is achievable.
I am personally happy with the performance of AMD Genoa e.g. for Gemma.cpp; f32 multipliers are not a bottleneck.
> The thing is that on many real-world SIMD workloads, Apple's 4x128bit either matches or outperforms either Intel's or AMD's implementation
Perhaps, though on VQSort it was more like 50% the performance. And if so, it's more likely due to the astonishingly anemic memory BW on current x86 servers. Bolting on more cores for ever more imbalanced systems does not sound like progress to me, except for poorly optimized, branch-heavy code.
> Perhaps, though on VQSort it was more like 50% the performance.
I looked at the paper and my interpretation is that the performance delta between M1 (Neon) and the Xeon (AVX2) can be fully explained by the difference in clock (3.7 vs 3.3 Ghz) and the difference in L1D bandwidth (48byes/cycle vs. 128bytes/cycle). I don't see any evidence here that narrow SIMD is less efficient.
The AVX-512 is much faster, but that is because it has hardware features (most importantly, compact) that are central to the algorithm. On AVX2 and Neon these are emulated with slower sequences.
Note that compact/compress are not actually the key enablers: also with AVX-512 we use table lookups for u64 keys, because this allows us to actually partition a vector and write it both to the left and write sides, as opposed to compressing twice and writing those individually.
Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?
> Note that compact/compress are not actually the key enablers: also with AVX-512 we use table lookups for u64 keys, because this allows us to actually partition a vector and write it both to the left and write sides, as opposed to compressing twice and writing those individually.
That is interesting! So do I understand you correctly that the 512b vectors allow you to implement the algorithm more efficiently? That would indeed be a nice argument for longer SIMD
> Isn't the L1d bandwidth tied to the SIMD width, i.e., it would be unachievable on Skylake if also only using 128-bit vectors there?
It's a hardware detail. Intel does tie it to SIMD width, but it doesn't have to be the case. For example, Apple has 4x128b units but can only load up to 48 bytes (I am not sure about the granularity of the loads) per cycle.
Right, longer vectors let us write more elements at a time.
I agree that the number of L1 load ports (or issue width) is also a parameter: that times the SIMD width gives us the bandwidth. It will be interesting to see what AMD Zen5 brings to the table here.
If you do streaming-type operations on long arrays, yes. If your data sizes are small, however, four smaller units might be more flexible. As a naive example, let's take the popular SIMD acceleration of hash tables. Since the key is likely to be found close to its optimal location, long SIMD will waste compute. With small SIMD however you could do multiple lookups in parallel courtesy of OoO.
This is why I like the ARM/Apple design with "regular SIMD" and "streaming SIMD". The regular SIMD is latency-optimized and offers versatile functionality for more flexible data swizzling, while the streaming SIMD uses long vectors and is optimized for throughput.
You can't do 2048 bits of addition in one SVE instruction; not portably, at least (and definitely not on any existing hardware). While the maximum SVE register size is 2048 bits, the minimum is 128 bits, and the hardware chooses the supported register size, not the programmer. For portable SVE, your code needs to work for all of those widths, not just the smallest or largest. (of related note is RISC-V RVV, which allows you to group up to 8 registers together, allowing a minimum portable operation width of 128×8 = 1024 bits in a single instruction (and up to 65536×8 = 64KB for hypothetical crazy hardware with max VLEN), but SVE/SVE2 don't have any equivalent)
A for() loop does the same thing at the cost of like 3 instructions. 4x128b has the flexibility that you don't need 512b wide operations on the same data to keep the ALUs fed. If you have 512b wide operations being split to 4x128b instructions, great, otherwise the massive OoOE window of modern chips can decode the next few loop iterations to keep the ALUs fed, or even pull instructions from a completely different kernel.
> What is the measurable benefit to implementing 128b SVE2
Probably not much, SVE2 has some nicer instructions, but neon already is quite solid.
> And implementing 256b SVE has different issues depending on how you do it
For in-order, and not very aggressively out-of-order cores having a larger vector length can be very useful to still get a lot of throughput out of your design. It also helps hide memory latency.
For aggressively out-of-order cores it should, for the most part, just be about decode, and some what memory latency hiding.
> 2x256b is only beneficial over 4x128b if you're limited by decode width [...] 3x256b would probably imply 3x128b which would regress existing NEON code.
I agree, that's why I don't get why people are "excited" for Zen5 to have 512b execution units, instead of 256b ones. At best there won't be a performance improvement for avx/avx2 code, at worst a regression.
Anyone interested in getting such numbers could run github.com/google/gemma.cpp on Arm hardware with hwy::DisableTargets(HWY_ALL_NEON) or HWY_ALL_SVE to compare the two :) I'd be curious to see the result.
Calling hwy::DispatchedTarget indicates which target is actually being used.
What is the percentage gain of using masked instructions on any benchmark/task of your choice? It can be negative on weird kernels that do lots of vector cmp since even ARM decided the cost of more than one write port in the predicate register file wasn't worth it, or if the masking adds lots of unnecessary and possibly false dependencies on the destination registers.
> This is only true if we ignore more complex instructions and focus on things like adding two vectors.
ARM implemented a CPU that had 2x256b SVE and 4x128b NEON. Literally the only benchmarks that benefitted from SVE were because they were limited by the 5-wide decode in NEON.
It's great you bring up cmp, helps to understand why 4x128 is not necessarily as good as 1x512. Quicksort, hardly a 'weird kernel', does comparisons followed by compaction. Because comparisons return a predicate, which have only a single write port, we can only do 128 bits of comparisons per cycle. Ouch.
However, masking can still help our VQSort [1], for example when writing the rightmost partition right to left without stomping on subsequent elements, or in a sorting network, only updating every second element.
I think it's somewhat unfair to ask for real world examples when there really aren't many people writing optimized SVE code right now. Probably because there are hardly any devices with the extension.
I think the transition from AVX2 to AVX512 is comparable in that it provided not only larger vectors, but also a much nicer ISA. There were certainly a few projects that benefited significantly from that move. simdjson is probably the most famous example [0].
>I think it's somewhat unfair to ask for real world examples when there really aren't many people writing optimized SVE code right now. Probably because there are hardly any devices with the extension.
Ironically, on the RISC-V side, RVV 1.0 hardware is readily available and cheap. BananaPI BPI-F3 (spacemiT K1) is RVA22+RVV, as well as some C908-based MCUs.
CPUs with SVE have been generally available for two years now. SME and AVX-512 got benchmarks written showing them off before the CPUs were even available. Seems fair to me.
simdjson specifically benefitted from Intel's hardware decision to implement a 512b permute from 2x 512b registers with a throughput of 1/cycle. That's area-expensive, which is (probably) why ARM has historically skimped on tbl performance, only changing as of the Cortex-X4.
Anyway simdjson is an argument for 256b/512b vector permute, not 128b SVE.
Having written a lot of NEON and investigated SVE... I disagree that SVE is a nicer ISA. The set of what's 2-operand destructive, what instructions have maskable forms vs. needing movprfx that's only fused on A64FX, and dealing the intrinsics issues that come from sizeless types are all unneeded headaches. Plus I prefer NEON's variable shift to SVE's variable shifts.
Fair point about movprfx, I understand they were short on encoding space. This can be mitigated by using *_x versions of intrinsics where masks are not used.
The sizeless headache is anyway there if you want to support RISC-V V, which we do.
One other data point in favor of SVE: its backend in Highway is only 6KLOC vs NEON's 10K, with a similar ratio of #if (indicating less fragmentation, more orthogonal).
It’s been a while since I looked, but I remember SVE2 being much more usable than SVE. A64FX was SVE IIRC. I think SVE did not do a great job of fully replacing NEON.
AVX512 is all around a nice addition as JIT-based runtimes like .NET (8+) can use it for most common operations: text search, zeroing, copying, floating point conversion, more efficient forms of V256 idioms with AVX512VL (select-like patterns replaced with vpternlog).
SVE2 is an extension on top of SVE which some stuff already implements. The issue is more likely to be the politics of moving to ARMv9 than anything else.
As to SVE though, I'd guess variable execution time makes the implementation require a bit of work. Normally, multi-cycle tasks have a fixed number. Your scheduler knows that MUL takes N cycles and plans accordingly.
SVE seems like it should require N-M cycles depending on what is passed. That must be determined and scheduled around. This would affect the OoO parts of the core all the way from ordering through to the end of the pipeline.
That's definitely bordering on new uarch territory and if that is the case, it would take 4-5 years from start to finish to implement. This would explain why all the ARMv8 guys never got around to it. ARMv9 makes it mandatory, but that was released in 2021 or so which means non-ARM implementors probably have a ways to go.
SVE doesn't need variable-execution-time instructions, outside of perhaps masked load/store, but those are already non-constant. Everything else is just traditional instructions (given that, from the perspective of the hardware, it has a fixed vector size), with a blend.
My guess is that Apple is simply not interested in some of the ARMv9 features. They are not eager to implement SVE and the se Ure virtualization features are probably not that relevant to them.
I do find it amusing that journalists never go beyond Twitter for discussion on this because this was all being confirmed on Mastodon days before any of the posts in the article
SME and streaming SVE. In fact I was going to include “…and nobody seems to have good evidence of ARMv9 support” but I figured my comment was enough as it was ;)
Thing is, normal people don’t really like interacting with the kind of person that would have jumped over to Mastodon. Zeal is insufferable most of the time.
Well, unfortunately, the kind of person who is an expert on SME is on Mastodon. So if you're writing an article on it, you should probably go to them instead of tech influencers who recycle content on Twitter.
I’m fairly certain that staying on (or joining) Twitter shows just about the same amount of “zeal” as leaving for Mastodon/BlueSky/Threads. Or put another way, I find people on Twitter to be insufferable.
It’s not a secret that a large portion of actually technical people (not tech influencers) left Twitter for Mastodon. So the people on Twitter may be more polished turds but are turds nonetheless.
Full disclosure: I still use my Twitter for customer support because that’s all it’s good for at this point IMHO. I also don’t regularly read mastodon but it’s where the people I care about are and when I post (rarely) that’s where I do it.
ARM really could've come up with better numbering / identification. I suppose it's ARM, emphasis on v, then 9, to differentiate it from ARM9, such as ARM9E-S?
This is more like supporting AVX512 than a whole separate architecture. If you have to target both old and new devices from one binary, you do a runtime feature check and call the corresponding code.
That is certainly more code, but not double. You only need it for the parts of the code that are both (a) bottlenecks worth optimizing and (b) actually benefit by using the new instructions.
In games, perhaps. In basically nothing else though. And I have 0 games on my M1, yet my apps folder is 23 GB. Docker, Edge, and SketchUp all 2+ GB, despite not having almost any UI to speak of.
(Edit to remove iMovie from the list, as it has GB's of "Transitions" and "Titles" that I really should just delete)
All three examples you gave have substantial UI and other bundled assets. For example, the Docker Desktop app is about 2GB on my computer, yet included assets make up at least 1.2GB, and a further 600MB is a bundle containing the UI, which itself is about 100MB of binaries.
If you actually open those bundles (as they're called on macos) and take a look inside, you'll see that they don't even contain all of their assets, anyways, often linking to frameworks contained in ~/Library
This is a very layperson explanation, btw, but I assure you that "in modern desktop software, code is a tiny bit of the total size of the application" is a very true statement.
There are cases where the app icon is larger than the compiled code in some apps and if you include a couple images for things like “here is how you give my app access to record the screen” that can also account for a large part of your app bundle.
Yes, game assets really take it to another level but it’s been my experience that even apps without a lot of UI still have their images making up a lot of the app bundle size.
ARMv9 is just ARMv8.5 with 4 extra extensions. It's not a complete overhaul like the ARMv7 to ARMv8 change was.
It's more comparable to x86 chips with AVX-512 and chips without AVX-512. 99% of your code is the same, but the compiler will generate SSE, AVX, and AVX-512 variants and choose the correct one based on the CPU.
Are there any extensions that ARMv9 is required to have? I'm looking through the reference manuals and those 4 extra extensions are all marked as "OPTIONAL" for ARMv9.
Openstep had as a default 4 hppa, sparc, i386 and m68k - I often built stuff on a HP for production use on Intel and 68000 boxes and I think they also had unreleased m88k as well at the same time so internally might have had five way binaries.
I do also believe it wasn't ever technically "statically linked", but the dynamic libraries were separately distributed as part of every app (and so I'd think the semantic slippage acceptable given the context). This has tons of advantages, but prevented Swift from being used in Apple libraries.
Right; I was using loose and casual language there; I meant "bundling", not literally statically-linked.
-----
I said "statically-linked" because the first thing that came to mind was go-lang's morbidly obese executable binary sizes, and mentally walked to iOS app sizes.
Windows binaries often seem to have excess statically linked libraries.. even though they are called DLLs which is supposed to mean dynamic. They might be loading it dynamically but they still seem to have decided to include their own private copy.
I've even seen windows binaries have multiple different versions of the same DLL inside them, and it's a well known DLL that is duplicated multiple places elsewhere.
All OSes/Apps do this but maybe a lot of Mac apps do it a little less. (I don't even have any real statistical idea how common this is with windows apps either)
From having worked on Windows, OSX, and Linux desktop software over the years there's a few factors at play off the top of my head:
- Windows DLLs don't usually have strong versioning baked into the filename. On OSX or Linux, there's usually the full version number baked in (libfoo.so.3.32.0) with symlinks stripping off version components. (libfoo.so, libfoo.so.3, libfoo.so.3.32) would all be symlinks to libfoo.so.3.32.0 and you can link against whichever major/minor/patch version you depend on. If your Windows app depends on a specific version it's going to be opening DLLs and querying them to find out what they are.
- Native OSX software (not Electron) seems to depend much less on piles of external libraries because the OSX standard library is very rich and has a solid history of not breaking APIs and ABI across OS versions. While eg CoreAudio is guaranteed to be installed on an OSX install and be either compatible or discoverably-incompatible, the version of DirectSound you're going to have access to on Windows is more of a crapshoot.
- Windows apps (except for the .Net runtime sometimes) are often designed for longevity. A couple of months ago I installed some software that was released in 1999 on my Windows 11 machine and it just worked. Bundling up those DLLs is part of why they work.
- Linux apps can rely on downstream packaging to install the necessary shared libraries on demand, generally speaking. Linux desktop apps distributed as RPMs or DEBs can "just" declare which libraries they need and get them delivered during install.
On Windows isn't it possible to have the OS deal with the DLL version issue by using side-by-side assemblies? I believe in practice that's only ever used by DLLs provided by the OS, but I thought it was possible to apply the mechanism to other DLLs as well.
Maybe? I haven’t really done a deep dive into that. You’d still have to bundle them along with the installer though since there isn’t a good way to request 3rd-party DLLs (heck, there isn’t even a good way to request a specific version of MSVCRT…)
Since the move to Apple Silicon you are realistically never more than 12-18 months away from a new chip generation in a MacBook. An M1 is still plenty good for the vast majority of workloads, especially if it's an M1 Pro/Max/Ultra.
Actually probably the best thing to do is wait until the M4 machines launch then bag a good deal on a clearance M3.
That’s actually a nice side effect of all the *rumors pages. The rumors of future products keep me of buying the current products. I keep on using my previous products while saving money and planet and being excited about what future holds.
On the contrary, I think that the reliable update cadence in modern electronics means that people should generally all but ignore future product roadmaps.
When you actually need to get a new device, just get whatever the up-to-date thing is.
OK, ok, I suppose that it's reasonable to check the rumor sites to see if you should delay by a month or two. But not any longer than that.
It's much harder with PCs, where you can get, for instance, new Thinkpad's with anything from 11th gen Core i all the way to new Core Ultras. And, now, ARMs as well...
I was inline to buy a 128GB M3 MAX, know that I know the M4 exists and already shipped in the iPad, it lets me know that the whole M4 pipeline has already started and what the perf numbers are, absolutely means I will be waiting. I survived yesterday without it, I can survive tomorrow. And now I can budget in the AMD Epyc bridge that covers that span.
I think Apple has been pretty good about hitting the right cadence with processor perf increases. They are making up for lost Intel time. The M6 is going to make us loose our minds. Apple is going to bring back "this is a munition" ads.
Both the M3 and the EPYC will be useful for far longer than the time it takes Apple to have the M4 on their next-gen laptops. Computers last a lot longer than they used to. I have a 10 year old Mac Mini that’s still comfortable to use, and, while an M3 Mac is a beast, it’s not that much faster than an M2 (or an i7) to create a qualitative change in my workflows. What is possible now was already possible last year. It’s just faster now. I get a higher return on investment with better keyboards and screens.
> The rumors of future products keep me of buying the current products.
For myself, I like to think of it as applied procrastination. I could buy that new thing I want today.. but something better will come along in time, so I can afford to put it off a while longer yet..
> The rumors of future products keep me of buying the current products.
Spot on!
Back in the nineties, Intel managed to push competing RISC architectures (UltraSparc, MIPS, DEC Alpha, PowerPC) out of the market using nothing but promises that Itanium was going to blow them all out of the water.
And apparently Apple is okay with procrastinating and cannibalizing current sales of M1, 2, 3 if it helps prevent some Snapdragon (or Ampere) sales.
>And apparently Apple is okay with procrastinating and cannibalizing current sales of M1, 2, 3 if it helps prevent some Snapdragon (or Ampere) sales.
sales of what
i actually can't think of a single competing product. admittedly i don't keep up with laptop news but still, i haven't heard of anything yet that can meaningfully compete with the m1 from four years ago
Microsoft just announced some lackluster arm laptops that they claim can compete with M-series chips. The question is what windows programs are gonna run on them...
Some people have been running Windows 11 for Arm on a VM in Apple Silicon. It has an automatic transcoder that translates most x86 code at start. It seems to run many apps well. Microsoft claims these new machines have a better transcoder. This might work.
For me at least, the best possible outcome of this is that Windows handheld gaming devices become more power-efficient. That might be an advantage over Linux-based handhelds for a while, unless Valve decide that Proton needs to also be an architecture emulator. The chip efficiency wins must surely be tempting in this form factor.
> The rumors of future products keep me of buying the current products.
You may have heard of the 5-minute rule - "Will doing this take me less than 5 minutes? If the answer is yes, do it now." An adaption of that to reduce impulse purchases is - "Do I really need this product right now? If the answer is no, don't buy it."
And on the flip side I am generally hesitant to buy first-release Apple hardware. Over the 20 years I've been buying Apple kit I've generally found it to be exceptionally robust but newly released hardware has had enough bugs (either hardware or OS) that I just sit back and let other users find the issues first. But I do simultaneously have the same issue: if WWDC is coming up within a month or two I'm not going to be buying any hardware because there's a good chance that something new will be released or the hardware I was going to buy is going to get a refresh or a price drop.
I do this technique too, and it's a great time for it. The OLED screen on the new iPad signals that Apple devices are moving to a better panel. If you've been waiting for the right time to move off an Intel Mac and onto a SoC Mac, it's now. Pick up a refurbished M2 MacBook. They're in the sweet spot for support, power, and cost.
The next one will probably have an OLED screen; so if you wait til then, your refurb M1/2/3 will be on Apple's short list of devices they don't want to support. (And you might have panel FOMO.) Or you'll have to pay the premium price for the latest model.
These machines are great. I still use my 2015 rMBP as a secondary. It's a little slow now but a couple years ago I was still running Solidworks (in Bootcamp) on it with minimal issues.
My wife is still using her 2012 MBP. We maxed out RAM and gave it an SSD in 2016. She uses it for video editing and music production. The thing look like new. Completely ridiculous. Only downside: no OSX updates since I don’t know when.
You might find OpenCore Legacy Patcher[1] worth a look. In many cases, it allows later-that-supported Mac OS versions to be installed on older Macs.
As a data point, I still use a 2013 Mac Pro as my primary desktop, and I've been using Sonoma on it for several months, have been able to install all Sonoma patches over-the-air on release without incident, and have only experienced a single, trivial problem: the right side of the menu bar occasionally appears shaded red, in a way that doesn't affect usability; switching applications immediately resolves the problem (the problem appears to be correlated with video playback).
video encoder/decoder support and performance has order of magnitude improvement in M series, I am surprised that didnt sway you.
Not just that, for high res stuff or modern codecs like AV1 or h265 is probably not supported at all in a 2012 device without updates for so long?
Even if support was possible it would be software encoding and even short clip it can take hours to render ?
I would happily use an older device for development a lot of dev work especially if not frontend or UI usually i can use any laptop just as a terminal, but UI or video editing I wouldn’t be able to.
I can't help but reply every time this thread comes up. I'd still probably be using my 2010 if it wasn't for a series mechanical failures. Paid to replace the keyboard once (85 screws, didn't need to do that to myself), but third battery crapping out, trackpad not clicking (probably due to swollen battery) and the MagSafe connector getting loose and glitchy was the end of it. Though I did just boot it up because my phone is somehow still supposed to sync music from it.
If the battery is swollen, get rid of it as soon as possible. Swollen battery == ticking time bomb, and I'm not joking about the bomb part. These things can, do and will explode randomly.
Overall, my 2020 M1 MBP is infinitely better than the 2015 MBP I had before, it's not even close. Battery life, thermal output, speed, noise, neural engine (for ML workloads). It's an utter workhorse that just marches on, no matter what I throw at it. I haven't even considered upgrading to another more current Mx version because this one just.. works. Best laptop I ever owned.
I just want to echo this experience and sentiment. I absolutely adore my 2020 13” M1 mbp, for all the reasons you list. I do ML workloads and Linux builds and I’m starting to think they forgot to put fans in mine because I’ve never heard them! Despite the annoying limitation of 1 external screen, it’s up there with my 2007 13” mb (rest in peace) as being the best laptop I’ve ever owned.
I recently upgraded from a 2019 Intel Mac to a similarly-specced M3 Mac, and it really is night and day. My battery life is more than doubled - I can run IntelliJ and multiple Docker containers on battery for more than my whole work day, when before it would barely last a couple hours with that load and be slow while doing so. The fan hardly ever runs while on my Intel Mac it would run constantly.
I can definitely say there's a downside. I sometimes take the bus home, but it can get chilly at night. Previously, I would fire up a little python script that saturate all the cores, to warm my lap. My old Intel was plenty warm to keep me from getting too uncomfortable. I can't even feel my M2 through my pants, and sticking it into my shirt makes me look like an idiot.
Battery life is insanely better. If you have not used one of the M series laptops it cannot be overstated how much better the battery life is. It is worth it for battery alone.
But beyond that they are also incredibly fast and run cool. In the MacBook Air there is no fan and on the Pros they barely ever spin up in an audible way.
The fans literally never come on for my personal M2 MBP 14" or on my work 16" M1 (it helps that the heavy lifting of running stuff and compiling happens on a dev server)
During work from home during Covid I was still using an Intel MBP and video conferences invariably caused the fans to kick up to the point where using noise cancelling headphones and not the built in speakers was necessary for sanity.
I went from the last Intel i9 16" MBP to an M3 Pro in the last month at work.
I think it's saving me an hour a day and the fan has never come on, the the laptop has never felt warm, and the battery life is just mind blowing.
I run docker & compilers all day. The i9 would run the fan 75% of the time and had to throttle down any time it was on battery power and it was lucky to last 3 hours on battery.
The way to exit that loop is to convince yourself that the next one will bring a truly lasting difference. Which is why I'm still waiting for GDDR7 GPUs with my 4GB RX 480.
It entirely depends how long you keep your devices. I try to keep my iPhones until release year + 6, so I would need the price of a previous version to be reduced by more than 1/6th on a new version release, which is usually not the case.
Similar to cars, most depreciation happens in the first year.
So owning a device for 6 years between age 1 and 7 will generally have a lower cost than owning a device between age 0 and 6.
For Apple products it’s generally feasible to effectively buy first hand devices aged 1+ because they’re still available for sale (at least in some retailers) after a new edition is released.
That’s a good strategy with most things that aren’t prone to mfg variability . For cars having a launch version there are a lot of initial manufacturing defects that need to be worked out.
Maybe it uses the camera for gesture recognition so you can air-write each letter one at a time? Air-quotes will be fun... air-tabs, not so much. "Space, but <widens arms> BIGGER!"
If anything, they'll probably use the "studio" branding (or more likely just have it under the iPad line, since they have desktop chips in them now anyways)
I just bought a second hand M2 Air in perfect condition and it feels faster than my M1 Max in a really beautiful body for travel. I’m not certain it matters that much anymore to be honest. What are you using it for?
So if you can 'limp' along towards the autumn/winter/Christmas, then it's probably worth the wait to get the M4 (or pickup an M3 when the price presumably drops to clear inventory).
I just bought a refurbished 16in M3 pro, no regrets at all. There's always a new one around the corner, it's really just about whether your setup achieves what you need it to.
Look at real world differences between M2 and M3, it's not a massive jump at all.
I do cross platform app development and the machine is excellent for that. Glad to have it now rather than waiting months for a slightly better system
The most incredible thing about the new iPads is that even with the crazy fast M4 chip MS Teams manages to crawl to a halt. Clearly it takes all the engineering skills of the largest and most valuable software company in the world to make text entry go at about 1 fps on a chip as powerful as the M4.
Every character you type results in some sort of hit to their telemetry server. It will include the actual letter you typed, if you or your org are not configured to be in EU. With their EU configuration option (pulled from server every launch) it will only report the fact that you typed _something_.
Now if that's not fun enough, their telemetry also covers mouse movements. Go ahead and watch your CPU as you spin your mouse in circles around the Teams window.
For extra fun, block their telemetry server and watch Teams bloat in RAM, to as much as your system has, as it keeps every action you take in local memory as it waits for the ability to talk to that telemetry server again.
If you're going to block their telemetry its best to fake an accept via some mitm proxy and send back a 200 code.
I do not know exactly how much this applies to iPad version, compared to their desktop apps. Mobile offers both more and less data possibilities. It's a different context.
So they’re running a keystroke logger and masquerading it as “telemetry”? That should be outlawed. It’s not a drafts feature, it’s not an online word processor, it’s just a straight up keystroke logger.
I’m in the EU and should have the EU configuration.
The problem mainly occurs when I mention someone in a reply to a thread. Once I type @<name> the text input just slows down so much I can type much faster than it can render the text.
It's still going through the same telemetry action, it just omits the actual character you typed. And yes, it is character by character. That collection (eg, the timestamp you hit the character, channel/person it was to, etc) is the inefficiency causing your typing to slow.
If it was a straight text box that they polled contents of _occasionally_ or after you hit 'send', it would be a much better user experience.
Haha what exactly does it protect me from? If they see that I typed something to someone, and see the chat history of me having sent a particular message at a particular time, it doesn't take a genius to put things together.
As much as I dislike overbearing telemetry in the context of an M4 or even an N95 the computer should be more than capable of logging a few kb of telemetry about inputs per second without even a notice on the performance statistics. The problem remains that every single thing in the app is just implemented god awful slow and would still be regardless if it's also recording telemetry on the input data or not.
...do you have any source verifying this? Setting aside how insane of an issue it'd be network-wise/privacy-wise/etc, this is like a day one "debounce the call" fix.
I'm not even saying I doubt you, I'm just curious how you ascertained this exact behavior.
Mainly this was just myself getting irritated at MS Teams and trying to figure out what it was doing. It was a couple years ago and my current company doesn't use teams, thankfully, so I can't really see if its still valid.
From what I remember..
There are files on the disk that get updated/overwritten with pulls from the server every time it launches. Somewhere in AppData I think. A few of these are config files (with lots of interesting looking settings, including beta features).
One of the config entries specifies a telemetry endpoint (which, you _could_ figure out with a network tracing tool but there are a ton of MS telemetry endpoints your machine is probably talking to. Best to just grab the one explicitly being used from the config like this). I forget the full name of the setting but the name pretty clearly indicates its for telemetry, and the file is clearly a config file. If you can't find it just by browsing the structure, try a multi-file search tool and look for 'telemetry' or URL/hostnames.
You can't really change the value on disk and make it just take effect from there, since it gets downloaded from the server and overwritten before Teams loads. There might be some tricks you can do locally to persist the change but nothing seemed to work for me. You could override response from server via mitmproxy but that requires finding where it comes across the wire at launch time and then building a script/config to replace it.
Anyway, you can block that telemetry endpoint from a firewall and see your memory bloat. Or you can intercept that endpoint in any mitm proxy. I went with this [mitmproxy](https://mitmproxy.org/). From there you can capture the content it sends to the endpoint, or even change the response the server sends (Teams just seems to expect a 200 code back).
The telemetry data itself is some kind of streaming event format. I think I even found documentation on the structure on some microsoft website, so its likely a reused format.
It's pretty straightforward.
I couldn't spend too much time on it and now it's not something I even use, but some cool things you might want to try if you dive deeper into this:
- Overwrite the config file as it returns from the server, to turn on EU data protection, change various functionality you're not supposed to, or flip some feature flags.
- Figure out if there's a feature flag or even other overwrite to fully disable the metrics so they aren't even collected, from anywhere in the app.
- Intercept telemetry, return an 'OK' response and drop the data from telemetry, or maybe document what they collect more definitively if you think there's interest somewhere. This keeps your privacy but doesn't really do anything for performance.
- Interfere with the data before actually returning it, maybe try playing with event contents and channel/user indicators. Microsoft probably won't like this if they notice, but it's unlikely they'll even notice.
No clue why they're even doing this and not just sampling after the fact. There's no way they are gleaning anything useful that they couldn't more efficiency (and anonymously) capture.
Ah, that's right, Chrom* based things look at the system store by default and it's the Firefox based things that don't (without configuration at least). Thanks.
Edit: and that reminds me I should probably run this test on new Teams, where it now uses the built in WebView2
Is the telemetry really that big of a load? Having a persistent connection and sending data over it is pretty standard for games going back decades, even AOL instant messenger had this feature for typing.
Asana used to sometimes have textareas that would take a full three seconds to display each key press. On then-current MacBook pros. You know, something that had nearly zero latency on first-gen single-core Pentium chips. Hell it may still do that, I never saw them fix it, I just finally got to stop using it.
Never underestimate the ability of shitware vendors to make supercomputers feel slower than an 8086. These days it usually involves JavaScript, HTML, and CSS.
MS Teams is by far the worst piece of software I’ve ever used. It is ungodly slow and it just gets slower the more you use it.
I believe it is actually hitting the server to update the online/away status light for every single message in a conversation. If you turn off all the status update stuff in the settings then the software speeds up dramatically. Another thing you can do is find the folder where it caches everything and just trash the entire thing. Somehow, they’ve managed to make caching slow everything down rather than provide a speed up.
I use it exclusively as a progressive web app on Linux and it's not particularly slow, but it is buggy as all hell. Easily the worst rich text editing experience of any chat client I've ever used. Teams is without a doubt the worst piece of software I'm required to use on a daily basis
Why is that not at least done asynchronously? I thought part of the whole narrative of shipping these new terrible pieces of software as standalone Google Chrome instances was that it makes it easier to spawn async JS workers for background tasks and whatnot?
async is still difficult. There is no getting around data synchronization issues. either you need to spend a lot of time in design or will get constant problems with things like not having a mutex when you should, mutex deadlock, holding a mutex too long, locking/unlocking too often.
I haven't done async JS, but I've done enough async elsewhere to know that language cannot work around bad design.
You still need to design a CRDT that solves your particular problem, you don't just say the magic word "CRDT" and the problem is gone. And the performance will depend on how good the design is.
I think that’s what it’s doing. If you have a conversation with a person spanning hundreds of messages (over many weeks) it’ll be updating the status light next to their name on every single message in the history. The more messages in the history, the more workers you get!
LOL, you wanna try it on my ~2 generations old Core i5 corporate laptop. Sometimes, the first steps of drawing the calendar view are roughly the same speed as me drawing it in Paint.
Maybe someone should normalise giving developers crappy laptops to develop on.
(Has anyone done a deep dive into Teams to explain what on earth is going on? I mean, if VSCode can be fast despite its underlying architecture, surely something could be done about Teams?)
>> Maybe someone should normalise giving developers crappy laptops to develop on.
Then the developers will complain the hardware is unusable to do their job even though that this was a supercomputer back in the day. Then you say "No, it's the software please fix it."
Yup, my work machine has an older i7 (2021-era maybe?) with 32GB RAM and between Teams, Slack, "new" Outlook, Jira, WSL, driving a 4K display off the piddly integrated GPU, and a VPN that involves every packet doing a transatlantic roundtrip whenever I want to connect to an internal service, everything is just dog slow. And the fan noise—my god the fan noise.
Some days it makes me extra motivated to make the code I write fast and efficient; other days I want to give up entirely.
Do companies using Teams have a choice of using something else or are their C*Os and IT departments married with Microsoft? If the latter, they'll use whatever Microsoft throws at them, even if it doesn't work.
Weird, a statically built ELF that supports TLS1.3 + HTTP1.1 is like ~30kb, all you need is Emacs as a UI and you have Teams 2 at 1.0e-5% resource usage.
Try Pidgin with the excellent ms teams plugin: https://github.com/EionRobb/purple-teams - less than 100mb ram usage and notifications that still work after an hour. Only for (video) calls you need to open teams..
Teams on the desktop has improved with their "v2" client. It's not the world's fastest piece of software, but I find it to not be embarrassingly slow now (on a reasonably specced machine).
One as to hope that the same perfomance lens will now be turned on the mobile apps.
It seems that M4 was overhyped. Almost of all of the performance improvements, in Geekbench for example, comes from new instructions that most apps won't use, and even if they do they might end up using the faster GPU/NPU for those tasks.
No, per-clock performance improvements between M3 and M4 range from 0% to 20%, this is ignoring the two subtests that benefit from SME. That Twitter post is moot. GB results show high variation, it is easy enough to cherry pick pairs of results that show any point you might want. You have to compare result distributions. There were some users on anandtech forums who did it and the results are very clear.
Makes one wonder has the apple miracle mostly been first the transition to ARM and having access to TSMC highest end nodes before the rest even comes into the picture. But im glad new competition is coming from qualcomm x elite and Huawei with their Kirin and Ascends chips. Hopefully ARMsrace will be more interesting to follow than the x64 race between intel and AMD.
Oryon was designed to compete with M1 then the clockspeeds were ramped up to compete with M2. M3 clearly beat it out and M4 has only furthered that lead.
Oryon will still probably beat x86 designs massively in performance per watt which is pretty much the most important metric for most people anyway (as most people use laptops).
EDIT: your username `dragonelite` is quite interesting. You joined 2019, but the coincidence is fascinating.
Bitcode did not allow recompilation to take advantage of new instructions. They dropped bitcode because they never actually managed to do anything with it other than the armvk7 to arm64_32 recompilation, and that required specifically designing arm64_32 around what was possible with bitcode.
Updating apps to use new vector instructions is far more complicated than upgrading to a new compiler version and having it magically get faster.
SME is very specialized, right now no compiler (that I know of) is really able to take general-purpose code and output optimized SME. So for these instructions at least, bitcode wouldn’t be of any benefit.
It’s a decent, but not revolutionary improvement. Yes, most of the gains outside of SME are coming from clock increases not IPC. I don’t know if I would call it overhyped, more like misunderstood.