"At the UH Cancer Center we routinely deal with datasets in the TB - PB range .....

akarve · on Sept 25, 2019

Good questions. First, services like open.quiltdata.com and Amazon's Registry of Open Data cover the S3 costs for public data. So that's one incentive. Second, the cost of cloud resources are highly competitive (if not superior) to on-premise data centers (see https://twitter.com/mohapatrahemant/status/11024016152632238... I don't think it's correct to think of S3 as expensive.

There are many ways to shave S3 costs (e.g. intelligent tiering, glacier), but at some point the data become so slow to access that you can't offer a pleasant user experience around browsing, searching, and feeding pipelines.

Most importantly, the "my data, my bucket" strategy gives users control over their data. A university with their own bucket has more control over their data than they do if Google, Facebook, etc. host and monetize it.

dragonwriter · on Sept 25, 2019

> If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ?

Well, technically, S3 Glacier and S3 Glacier Deep Archive is still S3, Cloud Storage Nearline is similar, except it's a tier on Google's S3-equivalent service.

But lots of public charities, especially academic institutions, host data in a way conveniently accessible to the public via well-known convenient APIs, including S3, even when it is not the least expensive method possible viewed strictly from the cost of storage and institution-internal access because of their mission.

kevinemoore · on Sept 25, 2019

+1 for everything Aneesh said, but I also wanted to add that the public cloud offers opportunities in data sharing that academia hasn't yet provided, specifically the ability for collaborators to bring their code to the data. I posted a quote from Jed Sundwall, Global Open Data Lead at AWS in another thread. I think he really nails it when he says that the cloud "completely changes the dynamic for sharing data."

There certainly have been efforts in academia to provide shared computing resources. Cyverse (https://www.cyverse.org/about) comes to mind. At Wisconsin many researchers shared clusters using Condor. But, none to my knowledge come close to the scale, reliability and features of AWS and the other major cloud providers.