CERN has released 300 terabytes of research data from LHC

throwaway_yy2Di · on April 22, 2016

Tangentially, there's even larger public datasets coming out of astronomy. PanSTARRS' imaging survey will make 2 PB available online. They even have a picture [0] of the completed dataset in transit, if you wondered what 2 PB of HDD's on a flatbed looks like.

[0] https://archive.stsci.edu/mug/mug_2016/PS1_MUG_2016jan14.pdf...

chris_va · on April 23, 2016

'Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.'

n00b101 · on April 23, 2016

2PB (5,000 lbs) / 4 days = 46 Gbps (excluding setup time)

Not bad, but not extraordinary

fra · on April 23, 2016

You should consider the impact of distance here as well...

bnegreve · on April 23, 2016

This is from Andrew Tanenbaum AFAIR

imaginenore · on April 23, 2016

(2 petabytes)/(1 gigabit/second) = 185.2 days

Good luck downloading that :)

Chronic51 · on April 23, 2016

Many companies and universities are connected to the internet backbone with 40 Gbps. No luck needed.

billybofh · on April 23, 2016

We're a CERN connected site and originally 'only' had 10x10G feeds. When the first data started to come in the network guy was looking a bit worried and said "the plane's coming into land, and the runway isn't nearly long enough..."

batbomb · on April 23, 2016

It has been pushed back to later this year.

throwaway_yy2Di · on April 23, 2016

Ah, thanks for the update.

jsprogrammer · on April 22, 2016

>“Once we’ve exhausted our exploration of the data, we see no reason not to make them available publicly,” says Kati Lassila-Perini, a CMS physicist who leads these data preservation efforts.

Why wait until they've exhausted their efforts?

noir_lord · on April 22, 2016

They might want first crack at any major discoveries, if they miss something then everyone else gets a crack.

Seems reasonable to me.

jsprogrammer · on April 23, 2016

Politics and fame over understanding and progress?

jefe78 · on April 23, 2016

If you don't like it, pay for it.

pkaye · on April 23, 2016

Who paid for this research anyway? Taxpayers or private organizations?

sgift · on April 23, 2016

Cern is funded by its member states, i.e. taxpayers:

https://en.wikipedia.org/wiki/CERN#Participation_and_funding

mef51 · on April 23, 2016

or politics and fame... and also understanding and progress. we want it all mofos!

dukwon · on April 23, 2016

1. It's only fair that the collaborations get first dibs on producing results out of their blood, sweat and tears.

2. The collaborations don't want to waste time shooting down the large number of false claims that would inevitably happen if the data were made public immediately.

hartem_ · on April 23, 2016

It's cool seeing technology developed at CERN in the spotlight. There are a lot of interesting tools developed there that can solve real problems outside CERN and academia.

One such technology, featured in the article, is CernVM File System that is used to distribute terabytes of scientific software to hundreds of datacenters all over the world.

A shameless plug:

Apache Mesos recently integrated it to solve container image distribution problem (https://mesosphere.com/blog/2016/03/08/cernvmfs-mesos-contai...).

Create · on April 23, 2016

A shameless plug:

Given that cheap and disposable trainees — PhD students and postdocs — fuel the entire scientific research enterprise, it is not surprising that few inside the system seem interested in change. A system complicit in this sort of exploitation is at best indifferent and at worst cruel.

http://www.nature.com/news/2011/110302/full/471007a.html

Potential missing staff in some areas is a separate issue, and educational programmes are not designed to make up for it. On-the-job learning and training are not separated but dynamically linked together, benefiting to both parties. In my three years of operation, I have unfortunately witnessed cases where CERN duties and educational training became contradictory and even conflicting.

http://ombuds.web.cern.ch/blog/2013/06/lets-not-confuse-stud...

Resolution of the Staff Council

- the Management does not propose to align the level of basic CERN salaries with those chosen as the basis for comparison;

- in the new career system a large fraction of the staff will have their advancement prospects, and consequently the level of their pension, reduced with respect to the current MARS system;

- the overall reduction of the advancement budget will have a negative impact on the contributions to the CERN Health Insurance System (CHIS);

http://cds.cern.ch/journal/CERNBulletin/2015/46/Staff%20Asso...

And a warning to non-western members:

"The cost [...] has been evaluated, taking into account realistic labor prices in different countries. The total cost is X (with a western equivalent value of Y) [where Y>X]

source: LHCb calorimeters : Technical Design Report

ISBN: 9290831693 cdsweb.cern.ch/record/494264

A shameless plug:

The Dangers of Self-Reference

Public relations pioneer Edward Bernays refined the creation and use of press releases.

Propaganda was used by the United States, the United Kingdom, Germany and others to rally for domestic support and demonize enemies during the World Wars, which led to more sophisticated commercial publicity efforts as public relations talent entered the private sector. Most historians believe public relations became established first in the US by Ivy Lee or Edward Bernays (he felt this manipulation was necessary in society), then spread internationally. Many American companies with PR departments spread the practice to Europe when they created European subsidiaries as a result of the Marshall plan.

swagopopotamus · on April 22, 2016

If only I could run to Fry's and buy a 300TB hard drive.

daveguy · on April 22, 2016

Well, you can "run out" and buy a 180TB 4U backblaze storage pod assembled for about $10,500. For $21,000 you can buy two and have 60TB to spare. $8,500/ $17,000 if you want to DIY. Not too bad:

https://www.backblaze.com/blog/cloud-storage-hardware/

atYevP · on April 22, 2016

Yev from Backblaze here -> http://www.backuppods.com/ check those guys out they'll build one for you, or you can DIY if you're handy. And then you can choose which drives you want to toss in it.

swagopopotamus · on April 22, 2016

Whoah! There's been a lot of improvements since the initial revision.

atYevP · on April 23, 2016

More to come...soon ;-)

atYevP · on April 25, 2016

OK it came -> https://www.backblaze.com/blog/open-source-data-storage-serv...

See! Soon!

avereveard · on April 23, 2016

The lhc experiments should be sensitive to a wide range of factors. I wonder if random correlating every variation of the results from same conditions could show some unexpected correlations like between particle path variation and earthquake (just speculating here not putting a theory forward)

visarga · on April 23, 2016

... on wikileaks!

Create · on April 23, 2016

In keeping with this spirit, here is a reminder of how we monitor (your) CERN activities. We monitor all network Traffic coming into and going out of CERN.

Our new analysis infrastructure will be able to cope with the automatic live analysis of about one terabyte of data every day. All this data is stored for one year.

Transparent monitoring for your protection

Achshar · on April 22, 2016

This [1] is apparently the data released. I am no physicist but that page doesn't exactly inspire awe among the curious minded.

They do explain how a couple of undergrads were able to use the data to create something meaningful in the original release article but that specific site can definitely use a UX designer, or two.

http://opendata.cern.ch/search?ln=en&p=Run2011A+AND+collecti...

noir_lord · on April 22, 2016

Indeed, I'll use the data from the other LHC with the pretty site instead.

Achshar · on April 22, 2016

I don't mean it has to be pretty, but that is not even pleasant to look at. I can provide all the useful data in the world but if it's accessibility of low then it's value is greatly reduced.

noir_lord · on April 22, 2016

I doubt it makes that much difference in reality, the value is in the data and since this data is unique and from a single source I can't see it mattering.

Not arguing the value of accessibility but in this case it's a nice to have rather than an essential.

miander · on April 22, 2016

The site is clearly designed for people who work in the field, and even then it only took me a moment to find a download link for some data. It even has a workable search function.

I'm not saying it couldn't be better.

tormeh · on April 22, 2016

I don't think they need to care about UX. The "conversion rate" is probably absurdly low given the need for storage, RAM and CPUs to store and process the data...

drauh · on April 23, 2016

http://home.cern/about/computing

"Physicists must sift through the 30 petabytes or so of data produced annually to determine if the collisions have thrown up any interesting physics.

…

The Data Centre processes about one petabyte of data every day - the equivalent of around 210,000 DVDs. The centre hosts 11,000 servers with 100,000 processor cores. Some 6000 changes in the database are performed every second. The Grid runs more than two million jobs per day. At peak rates, 10 gigabytes of data may be transferred from its servers every second."

Yeah, I don't think it's feasible to release data that can be used to do deep physics.

dekhn · on April 23, 2016

The logical thing to do is replicate the dataset to the 3 major cloud providers (Amazon, Google and Microsoft) so that anybody can attach their VMs to it with local data access speeds.