WTF would one deploy such thing in the cloud?

cottonseed · on Jan 6, 2016

Because renting 1000 cores for a limited time is much cheaper than buying them outright?

qaq · on Jan 6, 2016

1000 cores of what ? Vcore is marketing BS. Even if it was not marketing BS it's 28 2U 3 node boxes (if using older cpus) or 14 2U 3 node boxes (if using more recent ones) unless they have extremely spiky workload using AWS is pointless. Bandwidth bound scientific apps ==> use infiniband cluster.

cottonseed · on Jan 6, 2016

The OP is talking about running $0.027 worth of computation (1000 cores for 10s at 0.01/core/hr) and you think he should spend tens of thousands on hardware?

I'm not doubting a custom build will give him much greater bandwidth. I just doubt the workload has to be "extremely" spiky to make the cloud cost-effective.

Of course, he's going to get billed for 10m or 1hr minimum (Google or Amazon), so that's assuming he can amortize his startup across multiple jobs.

semi-extrinsic · on Jan 6, 2016

The big question is, why does it need to run in 10s? The main reason I can see is to be able to run this analysis very frequently, but then your workload is approaching constant.

The total amount of data is 150 GB; that would easily fit into memory on a single powerful 2-socket server with 20 cores and would then run in less than 15 minutes. The hardware required to do that will cost you ~ $6000 from Dell; assuming a system lifetime of five years and assuming (like you do) that you can amortize across multiple jobs, the cost is roughly the same as from the cloud, about $0.036 per analysis.

I'm fairly certain that, in the end, it's not more expensive for the customer to just buy a server to run the analysis on.

Edit: I see OP says 80% of the time is spent reading data into memory, at about 100 MB/s. Add $500 worth of SSD to the example server I outlined, and we can cut the application runtime by >70%, making the dedicated hardware significantly cheaper.

qaq · on Jan 6, 2016

Vcore is hyperthread of unknown CPU. So in reality 1000 vcores is 500 real cores. - All the overheads it's more like 450 given the low utilization until dataset loads to keep it at 10 sec you would need 90 real cores or 4 X 3 node dual boxes (ebay 1.5K each) and 2 X infiniband switches (ebay 2X300). For 6600 you have a dedicated solution with no latency bubbles fixed low cost.

zbjornson · on Jan 6, 2016

Briefly... We have many data sets, and the <10sec calculations happen every few seconds for every data set in active use. Caching results is rarely helpful in our case because the number of possible results is immense. The back end drives an interactive/real-time experience for the user, so we need the speed. Our loads are somewhat spikey; overnight in US time zones we're very quiet, and during daytime we can use more than 1k vCPUs.

We've considered a few kinds of platforms (AWS spot fleet/GCE autoscaled preemptible VMs, AWS Lambda, bare metal hosting, even Beowulf clusters), and while bare metal has its benefits as you've pointed out, at our current stage it doesn't make sense for us financially.

I omitted from the blog post that we don't rely exclusively on object storage services because its performance is relatively low. We cache files on compute nodes so we avoid that "80% of time is spent reading data" a lot of the time.

(Re: Netflix, in qaq's other comment, I don't have a hard number for this, but I thought a typical AWS data center is only under 20-30% load at any given time.)

qaq · on Jan 6, 2016

They have a single client running a single 10 sec job in a day? They plan to continue having a single client running a single 10 sec job in a day? The workload does have to be spiky to make the cloud cost-effective. There are workloads which are not appropriate for AWS. For any serious client AWS is a bad idea simply because there is single tenant (Netflix) consuming such a high percentage of resource that if they make a mistake causing a 40-50% increase in their load everyone gets f#$%ed.

slagfart · on Jan 6, 2016

You're hypothesising about something that has never happened. Check out some 3rd party cloud uptime metrics - the major providers (AWS, Google, Azure) have had less than an hour of downtime in the past year. Reliability is no longer on the agenda - it has been proven.

qaq · on Jan 6, 2016

It did happen and my clients were affected. After AWS fu$%up rollout of software update in 2011 that overvelmed their control plane and had whole zones down for many hours and took many days to fully restore, they rolled out patches that throttle cross zone migration. After those patches at one point Netflix was having issues and started massive migration that hit throttle thresholds and affected ability of other tenants to move to non affected zones. It's very far from hypothetical given netflix consumes about 30% of resources (which translates to many whole size physical datacenters) if they spike 50% they will overvelm the spare capacity.

gtaylor · on Jan 6, 2016

I spun up something like 200 "cores" to archive a large Cassandra cluster to Google Storage (Kubernetes cluster plus 200+ containers running the archive worker). Could have gone much bigger to get it done faster, but it wasn't necessary. ETL or archive jobs would be the most common case, to answer your question.