This is a fantastic use of the D-Wave technology and really highlights the ability to use annealing for a wide range of use cases. I'm definitely going to follow this development closely!
It will be interesting to see how this works on the Advantage2, especially when it moves past the prototype stage. The greater connectivity between qubits may allow for an even better demonstration.
I've been seeing Mac folklore posted to HN recently - this is a fantastically well produced podcast that goes through folklore.org and other articles from the time period, with asides, tidbits, interviews etc. that help paint a picture of this fantastic era in computing history. Well worth a listen.
Well, if your system elastically uses GPU compute and needs to be able to spin up, run compute on a GPU, and spin down in a predictable amount of time to provide reasonable UX, launch time would definitely be a factor in terms of customer-perceived reliability.
All the clouds are pretty upfront about availability being non-guaranteed if you don't reserve it. I wouldn't call it a reliability issue if your non-guaranteed capacity takes some tens of seconds to provision. I mean, it might be your reliability issue, because you chose not to reserve capacity, but it's not really unreliability of the cloud — they're providing exactly what they advertise.
"Guaranteed" has different tiers of meaning - both theoretical and practical.
In many cases, "guaranteed" just means "we'll give you a refund if we fuck up". SLAs are very much like this.
IN PRACTICE, unless you're launching tens of thousands of instances of an obscure image type, reasonable customers would be able to get capacity, and promptly from the cloud.
That's the entire cloud value proposition.
So no, you can't just hand-waive past these GCP results and say "Well, they never said these were guaranteed".
Ignoring the fact that the results are probably partially flawed due to methodology (see top-level comment from someone who works on GCE) and are not reproducible due to missing information, pointing out the lack of a guarantee is not hand-waving. The OP uses the word "reliability" to catch attention, which certainly worked, but this has nothing to do with reliability.
This isn't actually true, even for tiny customers. In a personal project, I used a single host of a single instance type several times per day and had to code up a fallback.
Try spinning up 32+ core instances with local ssds attached or anything not n1 family and you will find that in may regions you can only have like single digits of them
I'd still consider it as "performance issue", not "reliability issue". There is no service unavailability here. It just takes your system a minute longer until the target GPU capacity is available. Until then it runs on fewer GPU resources, which makes it slower. Hence performance.
The errors might be considered a reliability issue, but then again, errors are a very common thing in large distributed systems, and any orchestrator/autoscaler would just re-try the instance creation and succeed. Again, a performance impact (since it takes longer until your target capacity is reached) but reliability? not really
I’d like to see a breakdown of the cost differences. If the costs are nearly equal, why would I not choose the one that has a faster startup time and fewer errors?
With GCP you can right-size the CPU and memory of the VM the GPU is attached to, unlike the fixed GPU AWS instances, so there is the potential for cost savings there.
It is not reliably running the machine but reliably getting the machine.
Like the article said, The promise of the cloud is that you can easily get machines when you need them the cloud that sometimes does not get you that machine(or does not get you that machine in time) is a less reliable cloud than the one that does.
It’s still performance. If this was “AWE failed to deliver the new machines and GCP delivered”, sure, reliability. But this isn’t that.
The race car that finishes first is not “more reliable” than the one in 10th. They are equally as reliable, having both finished the race. The first place car is simply faster at the task.
You cannot infer that based on the results of the race...that's literally the entire point I am making. The 1st place car might blow up in the next race, the 10th place car might finish 10th place for the next 100 races.
If the article were measuring HTTP response times and found that AWS's average response time was 50ms and GCP's was 200ms, and both returned 200s for every single request in the test, would you say AWS is more reliable than GCP based on that? Of course not, it's asinine.
If you want that promise you can reserve capacity in various ways. Google has reservations. Folks use this for DR, your org can get a pool of shared ones going if you are going to have various teams leaning on GPU etc.
The promise of the cloud is that you can flexibly spin up machines if available, and easily spin down, no long term contracts or CapEx etc. They are all pretty clear that there are capacity limits under the hood (and your account likely has various limits on it as a result).
unfortunately cloud computing and marketing have conflated reliability, availability and fault tolerance so it's hard to give you a definition everyone would agree to, but in general I'd say reliability is referring to your ability to use the system without errors or significant decreases in throughput, such that it's not usable for the stated purpose.
in other words, reliability is that it does what you expect it to. GCP does not have any particular guarantees around being able to spin up VMs fast, so its inability to do so wouldn't make it unreliable. it would be like me saying that you're unreliable for not doing something when you never said you were going to.
if this were comparing Lambda vs Cloud Functions, who both have stated SLAs around cold start times, and there were significant discrepancies, sure.
true, the grammar and semantics work out, but since reliability needs a target usually it's a serious design flaw to rely on something that never demonstrably worked like your reliability target assumes.
so that's why in engineering it's not really used as such. (as far as I understand at least.)
Why would you scale to zero in high perf compute? Wouldn't it be wise to have a buffer of instances ready to pick up workloads instantly? I get that it shouldnt be necessary with a reliable and performant backend, and that the cost of having some instances waiting for job can be substantial depending on how you do it, but I wonder if the cost difference between AWS and GCP would make up for that and you can get an equivalent amount of performance for an equivalent price? I'm not sure. I'd like to know though.
> Why would you scale to zero in high perf compute?
Midnight - 6am is six hours. The on demand price for a G5 is $1/hr. That's over $2K/yr, or "an extra week of skiing paid for by your B2B side project that almost never has customers from ~9pm west coat to ~6am east coast". And I'm not even counting weekends.
But that's sort of a silly edge case (albeit probably a real one for lots of folks commenting here). The real savings are in predictable startup times for bursty work loads. Fast and low variance startup times unlock a huge amount of savings. Without both speed and predictability, you have to plan to fail and over-allocate. Which can get really expensive fast.
Another way to think about this is that zero isn't special. It's just a special case of the more general scenario where customer demand exceeds current allocation. The larger your customer base, and the burstier your demand, the more instances you need sitting on ice to meet customers' UX requirements. This is particularly true when you're growing fast and most of your customers are new; you really want a good customer experience every single time.
Scaling to zero means zero cost when there is zero work. If you have a buffer pool, how long do you keep it populated when you have no work?
Maintaining a buffer pool is hard. You need to maintain state, have a prediction function, track usage through time, etc. just spinning up new nodes for new work is substantially easier.
And the author said he could spin up new nodes in 15 seconds, that’s pretty quick.
>Over the past five years, there has been undeniable hype around quantum computing—hype around approaches, timelines, applications, and more. As far back as 2017, vendors were claiming the commercialization of the technology was just a couple of years away—like the announcement of a 5,000-qubit system by 2020 (which didn’t happen).
Inaccurate. D-Wave did in fact launch a 5000+ qubit system named Advantage in 2020.
> I don't mind that idea - if it turned out installing heated seats failed in say... 28% of installations, fine - charge me more for a model with functioning heated seats.
I know you just chose this as an example, but clearly it wouldn't be acceptable to anyone to have a failed installation of something like this. On one extreme, it could cause a fire; but otherwise, it's just a button that doesn't work, and damages the quality reputation of the car.
Processor binning only works because it is a black box, indistinguishable from the outside. You can purposely disable components if your yields get high enough, etc. and nobody will ever notice the difference.
For vehicles, automakers will learn from customers that this kind of nickel-and-diming is not appreciated, when people turn to other manufacturers that still "get it", e.g. Mazda.
Same here. It's barebones, but it's also cheap and plentiful on the used market, has enough functionality to be useful without being overwhelming, and fits easily in a backpack. Did some noodling around with it connected to my iPhone on a plane recently, no drivers needed if you get that Lighting-to-USB adapter, phone powered it fine, and I was able to drive Korg Gadget and make some shitty techno.
So go to the website and use it. Many of us simply find it as a way for more propaganda to be fed to us unasked-for when we open up a new tab (at least, until we remember to turn it off).
as with many things though it should be consent based and opt in; they can have a big button "add pocket to your firefox" and those who want it can opt in and everyone else won't have to go through the rigamarole of finding all the settings (strewn about in various locations including about:config) to disable it
Some of us would love to work on a team like this. It would be nice to have the option. Your definition of "acceptable" might not actually result in teams that can take on the big challenges we face as a species as men who did find this kind of thing acceptable retire out of the workforce.
If "We've always done it this way and it's a risk to do it differently" was the argument that carried the day, few of us would have to worry about these questions at all because we'd never have gotten out from under feudalism.
It will be interesting to see how this works on the Advantage2, especially when it moves past the prototype stage. The greater connectivity between qubits may allow for an even better demonstration.