I don't really know if those hardware breakthroughs that the article refers to a...

qayxc · on July 5, 2020

But that's not what the actual data says.

Here's some figures from an actual benchmark [1] w.r.t. training costs:

1. [Mar 2020] $7.43 (AlibabaCloud, 8xV100, TF v2.1)

2. [Sep 2018] $12.60 (Google, 8 TPU cores, TF v1.11)

3. [Mar 2020] $14.42 (AlibabaCloud, 128xV100, TF v2.1)

--

Training time didn't go down exponentially either [1]:

1. [Mar 2020] 0:02:38 (AlibabaCloud, 128 x V100, TF v2.1)

2. [May 2019] 0:02:43 (Huawei Cloud, 128 x V100, TF v1.13)

3. [Dec 2018] 0:09:22 (Huawei Cloud, 128 x V100, MXNet)

So again, I have to ask where exactly do these magical improvement occur (regarding training - inference is another matter entirely, I understand that)? I've yet to find a source that supports 4x to 10x cost reductions.

[1] https://dawn.cs.stanford.edu/benchmark/index.html

gchamonlive · on July 5, 2020

I guess I should have been more skeptical of the articles figures. But still, if we give the benefit of the doubt, is there any scenario we might see the reduction mentioned? 1000 to 10 USD?

qayxc · on July 5, 2020

The scenario is indeed there - if you take early 2017 numbers and restrict yourself to AWS/Google/Azure and outdated hardware and software, you can get to the US$1000 figure.

Likewise, if your other point of comparison is late 2019 AlibabaCloud spot pricing, you can get to US$10 for the same task.

Realistically, though, that's worst case 2017 vs best case 2019/2020. So you sure, you can get to that if you choose your numbers correctly.

They basically compared results from H/W that even in 2017 was 2 generations behind with the latest H/W. So yeah - between 2015 and 2019 we indeed saw a cost reduction from ~1000 to ~10 USD (on the major cloud provider vs best offer today scale).

I only take issue with the assumption that the trend continues this way, which it doesn't seem to.