Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
The cost to train an AI system is improving at 50x the pace of Moore’s Law (ark-invest.com)
254 points by kayza on July 5, 2020 | hide | past | favorite | 54 comments


Resnet-50 with DawnBench settings is a very poor choice for illustrating this trend. The main technique driving this reduction in cost-to-train has been finding arcane, fast training schedules. This sounds good until you realize its a type of sleight of hand where finding that schedule takes tens of thousands of dollars (usually more) that isn't counted in cost-to-train, but is a real-world cost you would experience if you want to train models.

However, I think the overall trend this article talks about is accurate. There has been an increased focus on cost-to-train and you can see that with models like EfficientNet where NAS is used to optimize both accuracy and model size jointly.


I would guess that this means DawnBench is basically working. You'll get some "overfit" training schedule optimizations, but hopefully amongst those you'll end up with some improvements you can take to other models.

We also seem to be moving more towards a world where big problem-specific models are shared (BERT, GPT), so that the base time to train doesn't matter much unless you're doing model architecture research. For most end-use cases in language and perception, you'll end up picking up a 99%-trained model, and fine tuning on your particular version of the problem.


This is an odd framing.

Training has become much more accessible, due to a variety of things (ASICs, offerings from public clouds, innovations on the data science side). Comparing it to Moore's Law doesn't make any sense to me, though.

Moore's Law is an observation on the pace of increase of a tightly scoped thing, the number of transistors.

The cost of training a model is not a single "thing," it's a cumulative effect of many things, including things as fluid as cloud pricing.

Completely possible that I'm missing something obvious, though.


> Comparing it to Moore's Law doesn't make any sense to me, though.

I assume it's meant as a qualitative comparison rather than a meaningful quantitative one. Sort of a (sub-)cultural touchstone to illustrate a point about which phase of development we're in.

With CPUs, during the phase of consistent year after year exponential growth, there were ripple effects on software. For example, for a while it was cost-prohibitive to run HTTPS for everything, then CPUs got faster and it wasn't anymore. So during that phase, you expected all kinds of things to keep changing.

If deep learning is in a similar phase, then whatever the numbers are, we can expect other things to keep changing as a result.


> then CPUs got faster and it wasn't anymore

The enabling tech was AES-NI instruction set, not the speed.

Agree on the rest. The main reason why modern CPUs and GPUs all have 16-bit floats is probably the deep learning trend.


If it hadn't been aes-ni, it would have been chacha, which is much faster than unaccelerated aes and close to the speed of accelerated aes.

Phones use https without a problem, and those haven't had hw-accelerated aes until recently.


A phone needing to set up a dozen HTTPS sockets is nothing for the CPU to do even without acceleration. A server needing to consistently set up hundreds of HTTPS sockets is where AES-NI and other accelerated crypto instructions becomes useful.


Like many things, Moore’s law is garbled when adopted by analogy outside its domain.

What does “more transistors” mean? To you, it means just what Gordon Moore means when he said it: opportunity for more function in same space/cost.

The laypersons, marketing grabbed the term and said it would imply “faster”. Which then was absurdly conflated with CPU clock speed (itself an important input, though hardly the only one, determining the actual speed of A system).

The use here is of the “garbled analogy” sort which surely is the dominant use today.


Yes but that aspect of Moore's law for CPUs expired over a decade ago. It's the whole reason we got multicore in the first place.


Even with multi-core, a CPU today is only 6x faster than a 10-year old CPU.


The difference might be even less. 4 Sandy Bridge cores (excluding memory controller and graphics) were not much bigger than the current 8 core Zen 2 die.

Certainly the peak performance you can put in a socket is much higher, but it's got more silicon in it than it used to.


Agreed, but Moore's Law has morphed to refer to both xtors and performance despite his original phrasing.

The biggest innovation I've seen is in the cloud: backplane I/O and memory is essential and up until a few years ago there weren't many cloud configurations suitable for massive amount of I/O.


Ok, but achieving Moore's law has required combining an enormous number of conceptually distinct technical insights. Both training costs and transistor density seem like well-defined single parameters that incorporate many small complicated effects.


Implicit in moore's law is that cost does not increase in the same way, if not prices of chips would also be doubling. Something more analogously sounding would be the decrease of cost per transistor.

The number of transistors is also not dependent on a single thing, it can be argued many macro events contributed since the 80s, the VC model for chipmakers in SV, the rise of the internet, going fabless, rise of mobile, innovations in fabrication technology.


What are some domains that a solo developer could build something commercially compelling to capture some of this $37 trillion? Are there any workflows or tools or efficiencies that could be easily realized as a commercial offering that would not require massive man hours to implement?


Take any domain that requires classification work that has not yet been targeted and make a run for it. You likely will be able to adapt one of the existing nets or even use transfer learning to outperform a human. That's the low hanging fruit.

For instance: quality control: abnormality detection (for instance: in medicine), agriculture (lots of movement there right now), parts inspection, assembly inspection, sorting and so on. There are more applications for this stuff than you might think at first glance, essentially if a toddler can do it and it is a job right now that's a good target.


> abnormality detection (for instance: in medicine), agriculture (lots of movement there right now), parts inspection, assembly inspection, sorting and so on

none of these is anything someone can run from their bedroom because they have very high quality and regulatory requirements and require constant work outside of the actual AI training.

This is actually reflected in the margins of "AI" companies, which are significantly lower than traditional SAAS businesses and require significantly more manpower to deal with the long tailed problems, which is where the AI fails but it's what actually matters.


Well, depending on the size of your bedroom ;) I've seen teams of two people running fairly impressive ML based stuff. They were good enough at it that they didn't remain at two people for very long but that was more than enough to be useful to others. One interesting company - that I'm free to talk about - did a nice one on e-commerce sites to help with risk management: spot fraudulent orders before they ship.

In the long term, and to stay competitive you will always have to get out of bed and go to work. But the initial push can easily be just a very low number of people engaging an otherwise dormant niche.

Yes, medicine has regulatory requirements. But as long as you advise rather than diagnose the regulatory requirements drop to almost nil.


anything that's even remotely profitable is already taken


This simply isn't true. Every year since the present day ML wave started has seen more and more domains tackled. Even something like that silly lego sorting machine I built could be the basis of a whole company pursuing sorting technology if you set your mind to it. And that's just resnet50 in disguise, likely you could do better today without any effort.

Your statement reminds of 'all the good domains are taken', which I've been hearing since 1996 or so. Of course you'll need to do some work to identify a niche that doesn't have a major player in it yet. But the 'boring' niches are where a lot of money is to be made, the sexy stuff (cancer, fruit sorting) is well covered. But more obscure things are still wide open, I get decks with some regularity about new players in very interesting spaces using thinly wrapped ML to do very profitable things.


Ah yes, of course. There will never be A new profitable ML startup until the end of time. Makes perfect sense.


People said the same thing about SaaS 5 years ago


You can give this article by Chip Huyen a read. Mayhaps you will find a niche for a solo or small dev team. Though it is focused on MLOps if that makes a different for the type of niche you're looking for.

https://huyenchip.com/2020/06/22/mlops.html


Extracting and selling data stuck in the mountain ranges of pdfs and other useless formats in every large corp, org, govt dept on the planet.

Do it for a couple publicly available docs and then contact the org saying you offer 'archive digitization' so their data ppl can mine for intelligence.

Most of the time and resources of 'Digital Transformation'/Data Science Depts goes to just manually extracting info from all kinds of old docs, pdfs, spreadsheets containing institutional knowledge.


You need to be creative. But one example - colorizing old photos: https://twitter.com/citnaj


The cost of training is decreasing, but the meaningfully large and non-trivial training sets are almost exclusively in the domain of large companies, economically inaccessible to the individual developers/startups.


This is a space I worked on during the crypto boom of 2017/2018.

The opportunity is present for a decentralized network that allows for training of models to be done from training sets at facilities.

Think of all the data sitting in silos from clinical trials. There is of course the painful process of authenticating researchers for access to data like that but it can be done. There just needs to be an economic reason to make that kind of effort.

I got pulled into a direction of using ML to predict costs of care in insurance so didn’t go further down the rabbit hole but I did author a patent for a novel approach to have a decentralized identity exchange data.

If any of this sounds exciting to you feel free to email me. hn (at) strapr (dot) com


krisp.ai but using gpu (also on mac) and with desktop version for ubuntu linux.


Ark Invest are the creators of the ARKK [1] and ARKW ETFs that have become retail darlings, mainly because they're heavily invested in TSLA.

They pride themselves on this type of fundamental, bottom up analysis on the market.

It's fine.. I don't know if I agree with using Moore's law which is fundamentally about hardware, with the cost to run a "system" which is a combination of customized hardware and new software techniques

[1] https://pages.etflogic.io/?ticker=ARKK


I remember this article from 2018: https://medium.com/the-mission/why-building-your-own-deep-le...

Hackernews discussion for the article: https://news.ycombinator.com/item?id=18063893

It really is interesting how this is changing the dynamics of neural network training. Now it is affordable to train a useful network on the cloud, whereas 2 years ago that would be reserved to companies with either bigger investments or an already consolidated product.


> Now it is affordable to train a useful network on the cloud

I honestly don't see how anything changed significantly in past 2 years. Benchmarks indicate that a V100 is barely 2x the performance of an RTX 2080 Ti [1] and a V100 is

• $2.50/h at Google [2]

• $13.46/h (4xV100) at Microsoft Azure [3]

• $12.24/h (4xV100) at AWS [4]

• ~$2.80/h (2xV100, 1 month) at LeaderGPU [5]

• ~$3.38/h (4xV100, 1 month) at Exoscale [6]

Other smaller cloud providers are in a similar price range to [5] and [6] (read: GCE, Azure and AWS are way overpriced...).

Using the 2x figure from [1] and adjusting the price for the build to a 2080 Ti and an AMD R9 3950X instead of the TR results in similar figures to the article you provided.

Please point me to any resources that show how the content of the article doesn't apply anymore, 2 years later. I'd be very interested to learn what actually changed (if anything).

NVIDIA's new A100 platform might be a game changer, but it's not yet available in public cloud offerings.

[1] https://lambdalabs.com/blog/best-gpu-tensorflow-2080-ti-vs-v...

[2] https://cloud.google.com/compute/gpus-pricing

[3] https://azure.microsoft.com/en-us/pricing/details/virtual-ma...

[4] https://aws.amazon.com/ec2/pricing/on-demand/

[5] https://www.leadergpu.com/#chose-best

[6] https://www.exoscale.com/gpu/


You are missing TPU and spot/preemptible pricing, which need to be considered when we are talking about training cost. The big one to me is the ability to consistently train on V100s with spot pricing, which was not possible a couple of years ago (there wasn't enough spare capacity). Also, the improvement in cloud bandwidth for DL-type instances has helped distributed training a lot.


Nothing really has changed in the last two years in terms of training cost. I think the author is making unreasonable extrapolations based on changes in performance on the Dawn benchmarks. A lot of the results are fast but require a lot more compute / search time to find the best parameters and training regimen that lead to those fast convergence times. (Learning rate schedule, batch size, image size schedules, etc.) The point being that once the juice is squeezed out you aren’t going to continue to see training convergence time improvements on the same hardware.

Also, because you cited our GPU benchmarks, I also wanted to throw in a mention our GPU instances which have some of the lowest training costs on the Stanford Dawn Benchmarks discussed in the article.

https://lambdalabs.com/service/gpu-cloud


Another data point:

"For example, we recently internally benchmarked an Inferentia instance (inf1.2xlarge) against a GPU instance with an almost identical spot price (g4dn.xlarge) and found that, when serving the same ResNet50 model on Cortex, the Inferentia instance offered a more than 4x speedup."

https://towardsdatascience.com/why-every-company-will-have-m...


That data point talks about inference though, and nobody's arguing that deployment and inference have improved significantly over the past years.

I'm referring to training and fine-tuning, not inference, which - let's be honest - can be done on a phone these days.


I don't really know if those hardware breakthroughs that the article refers to already reflects in Cloud GPU performance, but software reflects nonetheless. So even though pricing has fluctuated marginally since 2018, it is just plain faster to train a neural network today because of software advances, from what I understood.


But that's not what the actual data says.

Here's some figures from an actual benchmark [1] w.r.t. training costs:

1. [Mar 2020] $7.43 (AlibabaCloud, 8xV100, TF v2.1)

2. [Sep 2018] $12.60 (Google, 8 TPU cores, TF v1.11)

3. [Mar 2020] $14.42 (AlibabaCloud, 128xV100, TF v2.1)

--

Training time didn't go down exponentially either [1]:

1. [Mar 2020] 0:02:38 (AlibabaCloud, 128 x V100, TF v2.1)

2. [May 2019] 0:02:43 (Huawei Cloud, 128 x V100, TF v1.13)

3. [Dec 2018] 0:09:22 (Huawei Cloud, 128 x V100, MXNet)

So again, I have to ask where exactly do these magical improvement occur (regarding training - inference is another matter entirely, I understand that)? I've yet to find a source that supports 4x to 10x cost reductions.

[1] https://dawn.cs.stanford.edu/benchmark/index.html


I guess I should have been more skeptical of the articles figures. But still, if we give the benefit of the doubt, is there any scenario we might see the reduction mentioned? 1000 to 10 USD?


The scenario is indeed there - if you take early 2017 numbers and restrict yourself to AWS/Google/Azure and outdated hardware and software, you can get to the US$1000 figure.

Likewise, if your other point of comparison is late 2019 AlibabaCloud spot pricing, you can get to US$10 for the same task.

Realistically, though, that's worst case 2017 vs best case 2019/2020. So you sure, you can get to that if you choose your numbers correctly.

They basically compared results from H/W that even in 2017 was 2 generations behind with the latest H/W. So yeah - between 2015 and 2019 we indeed saw a cost reduction from ~1000 to ~10 USD (on the major cloud provider vs best offer today scale).

I only take issue with the assumption that the trend continues this way, which it doesn't seem to.


I trained a useful neural network and prototyped a viable [failed] startup technology something like 4 years ago on a 1080ti with a mid range CPU. It was enough to get me meetings with a couple of the largest companies in the world.

Yeah it took 12-24 hours to do what I could login to AWS and accomplish in minutes with parallel GPUs...but practical solutions were already in reach. The primary changes now are buzz and possibly unprecedent rate of research progress.


I would really like a thorough analysis on how expensive it is to multiply large matrices, which is the most expensive part of a transformer training for example according to the profiler. Is there some Moore’s law or similar trend?


It is regrettable if an equivalent to the self-fulfilling prophecy of Moore's "Law" (originally an astute observation and forecast, but not remotely a law) became a driver/limiter in this field as well, even more so if it's a straight transplant for soundbite reasons rather than through any impartial and thoughtful analysis.


One thing I've wondered is if Moore's Law is good or bad, in the sense of how fast should we have been able to improve IC technology. Was progress limited by business decisions or is this as fast as improvements could take place?

A thought experiment: suppose we meet aliens who are remarkably similar to ourselves and have an IC industry. Would they be impressed by our Moore's law progress, or wonder why we took so long?


https://en.wikipedia.org/wiki/Moore%27s_law, third paragraph of the header, claims that Moore's Law drove targets in R&D and manufacturing, but does not cite a reference for this claim.

"Moore's prediction has been used in the semiconductor industry to guide long-term planning and to set targets for research and development."


I'm not sure what the point of that question is. In theory you could have a government subsidize construction of fabs so that skipping nodes is feasible but why on earth would you do that when the industry is fully self sufficient and wildly profitable?


The cost to collect the huge amounts of needed to train meaningful models is surely not growing at this rate.


Despite nvidia vaguely prohibiting users from using their desktop cards for machine learning in any sort of data center-like or server-like capacity. Hopefully AMDs ml support / OpenCl will continue improving


Last I saw, they don’t even support ROCm on their recent Navi cards, so I’d be hesitant.


Wow. This is really disappointing to see. (https://github.com/RadeonOpenCompute/ROCm/issues/887)

I guess PlaidML might be a viable option?


Does it mean that the cost to train something like gpt3 by OpenAI will reduce from 12 million dollars to less next year ? If so how much will it reduce to ?


It was probably because very inefficient to begin with.


Indeed nonexistent


"AI" is not really appropriate name for what it is


tl;dr: Training learners is becoming cheaper every year, thanks to big tech companies pushing hardware and software.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: