"At the UH Cancer Center we routinely deal with datasets in the TB - PB range ..."
...
"Do you happen to have an S3 bucket with that data live?"
As someone not working in academia (or in this field at all) can you help me understand the question you have just asked ?
Specifically, wouldn't it be tremendously profligate for them to have that PB range dataset living in S3 ?
Given the resources that a university has (in both Internet2 connectivity, hardware budget and (relatively) cheap manpower), why would they ever store that data outside of their own UH datacenter ?
If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ?
Good questions. First, services like open.quiltdata.com and Amazon's Registry of Open Data cover the S3 costs for public data. So that's one incentive. Second, the cost of cloud resources are highly competitive (if not superior) to on-premise data centers (see https://twitter.com/mohapatrahemant/status/11024016152632238... I don't think it's correct to think of S3 as expensive.
There are many ways to shave S3 costs (e.g. intelligent tiering, glacier), but at some point the data become so slow to access that you can't offer a pleasant user experience around browsing, searching, and feeding pipelines.
Most importantly, the "my data, my bucket" strategy gives users control over their data. A university with their own bucket has more control over their data than they do if Google, Facebook, etc. host and monetize it.
> If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ?
Well, technically, S3 Glacier and S3 Glacier Deep Archive is still S3, Cloud Storage Nearline is similar, except it's a tier on Google's S3-equivalent service.
But lots of public charities, especially academic institutions, host data in a way conveniently accessible to the public via well-known convenient APIs, including S3, even when it is not the least expensive method possible viewed strictly from the cost of storage and institution-internal access because of their mission.
+1 for everything Aneesh said, but I also wanted to add that the public cloud offers opportunities in data sharing that academia hasn't yet provided, specifically the ability for collaborators to bring their code to the data. I posted a quote from Jed Sundwall, Global Open Data Lead at AWS in another thread. I think he really nails it when he says that
the cloud "completely changes the dynamic for sharing data."
There certainly have been efforts in academia to provide shared computing resources. Cyverse (https://www.cyverse.org/about) comes to mind. At Wisconsin many researchers shared clusters using Condor. But, none to my knowledge come close to the scale, reliability and features of AWS and the other major cloud providers.
...
"Do you happen to have an S3 bucket with that data live?"
As someone not working in academia (or in this field at all) can you help me understand the question you have just asked ?
Specifically, wouldn't it be tremendously profligate for them to have that PB range dataset living in S3 ?
Given the resources that a university has (in both Internet2 connectivity, hardware budget and (relatively) cheap manpower), why would they ever store that data outside of their own UH datacenter ?
If the answer is "offsite backup" wouldn't it be glacier or nearline or ... anything but S3 ?