Hacker News new | past | comments | ask | show | jobs | submit login
CERN has released 300 terabytes of research data from LHC (symmetrymagazine.org)
153 points by elorant on April 22, 2016 | hide | past | favorite | 37 comments



Tangentially, there's even larger public datasets coming out of astronomy. PanSTARRS' imaging survey will make 2 PB available online. They even have a picture [0] of the completed dataset in transit, if you wondered what 2 PB of HDD's on a flatbed looks like.

[0] https://archive.stsci.edu/mug/mug_2016/PS1_MUG_2016jan14.pdf...


'Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.'


2PB (5,000 lbs) / 4 days = 46 Gbps (excluding setup time)

Not bad, but not extraordinary


You should consider the impact of distance here as well...


This is from Andrew Tanenbaum AFAIR


(2 petabytes)/(1 gigabit/second) = 185.2 days

Good luck downloading that :)


Many companies and universities are connected to the internet backbone with 40 Gbps. No luck needed.


We're a CERN connected site and originally 'only' had 10x10G feeds. When the first data started to come in the network guy was looking a bit worried and said "the plane's coming into land, and the runway isn't nearly long enough..."


It has been pushed back to later this year.


Ah, thanks for the update.


>“Once we’ve exhausted our exploration of the data, we see no reason not to make them available publicly,” says Kati Lassila-Perini, a CMS physicist who leads these data preservation efforts.

Why wait until they've exhausted their efforts?


They might want first crack at any major discoveries, if they miss something then everyone else gets a crack.

Seems reasonable to me.


Politics and fame over understanding and progress?


If you don't like it, pay for it.


Who paid for this research anyway? Taxpayers or private organizations?


Cern is funded by its member states, i.e. taxpayers:

https://en.wikipedia.org/wiki/CERN#Participation_and_funding


or politics and fame... and also understanding and progress. we want it all mofos!


1. It's only fair that the collaborations get first dibs on producing results out of their blood, sweat and tears.

2. The collaborations don't want to waste time shooting down the large number of false claims that would inevitably happen if the data were made public immediately.


It's cool seeing technology developed at CERN in the spotlight. There are a lot of interesting tools developed there that can solve real problems outside CERN and academia.

One such technology, featured in the article, is CernVM File System that is used to distribute terabytes of scientific software to hundreds of datacenters all over the world.

A shameless plug:

Apache Mesos recently integrated it to solve container image distribution problem (https://mesosphere.com/blog/2016/03/08/cernvmfs-mesos-contai...).


A shameless plug:

Given that cheap and disposable trainees — PhD students and postdocs — fuel the entire scientific research enterprise, it is not surprising that few inside the system seem interested in change. A system complicit in this sort of exploitation is at best indifferent and at worst cruel.

http://www.nature.com/news/2011/110302/full/471007a.html

Potential missing staff in some areas is a separate issue, and educational programmes are not designed to make up for it. On-the-job learning and training are not separated but dynamically linked together, benefiting to both parties. In my three years of operation, I have unfortunately witnessed cases where CERN duties and educational training became contradictory and even conflicting.

http://ombuds.web.cern.ch/blog/2013/06/lets-not-confuse-stud...

Resolution of the Staff Council

- the Management does not propose to align the level of basic CERN salaries with those chosen as the basis for comparison;

- in the new career system a large fraction of the staff will have their advancement prospects, and consequently the level of their pension, reduced with respect to the current MARS system;

- the overall reduction of the advancement budget will have a negative impact on the contributions to the CERN Health Insurance System (CHIS);

http://cds.cern.ch/journal/CERNBulletin/2015/46/Staff%20Asso...

And a warning to non-western members:

"The cost [...] has been evaluated, taking into account realistic labor prices in different countries. The total cost is X (with a western equivalent value of Y) [where Y>X]

source: LHCb calorimeters : Technical Design Report

ISBN: 9290831693 cdsweb.cern.ch/record/494264

A shameless plug:

The Dangers of Self-Reference

Public relations pioneer Edward Bernays refined the creation and use of press releases.

Propaganda was used by the United States, the United Kingdom, Germany and others to rally for domestic support and demonize enemies during the World Wars, which led to more sophisticated commercial publicity efforts as public relations talent entered the private sector. Most historians believe public relations became established first in the US by Ivy Lee or Edward Bernays (he felt this manipulation was necessary in society), then spread internationally. Many American companies with PR departments spread the practice to Europe when they created European subsidiaries as a result of the Marshall plan.


If only I could run to Fry's and buy a 300TB hard drive.


Well, you can "run out" and buy a 180TB 4U backblaze storage pod assembled for about $10,500. For $21,000 you can buy two and have 60TB to spare. $8,500/ $17,000 if you want to DIY. Not too bad:

https://www.backblaze.com/blog/cloud-storage-hardware/


Yev from Backblaze here -> http://www.backuppods.com/ check those guys out they'll build one for you, or you can DIY if you're handy. And then you can choose which drives you want to toss in it.


Whoah! There's been a lot of improvements since the initial revision.


More to come...soon ;-)



The lhc experiments should be sensitive to a wide range of factors. I wonder if random correlating every variation of the results from same conditions could show some unexpected correlations like between particle path variation and earthquake (just speculating here not putting a theory forward)


... on wikileaks!


In keeping with this spirit, here is a reminder of how we monitor (your) CERN activities. We monitor all network Traffic coming into and going out of CERN.

Our new analysis infrastructure will be able to cope with the automatic live analysis of about one terabyte of data every day. All this data is stored for one year.

Transparent monitoring for your protection


This [1] is apparently the data released. I am no physicist but that page doesn't exactly inspire awe among the curious minded.

They do explain how a couple of undergrads were able to use the data to create something meaningful in the original release article but that specific site can definitely use a UX designer, or two.

http://opendata.cern.ch/search?ln=en&p=Run2011A+AND+collecti...


Indeed, I'll use the data from the other LHC with the pretty site instead.


I don't mean it has to be pretty, but that is not even pleasant to look at. I can provide all the useful data in the world but if it's accessibility of low then it's value is greatly reduced.


I doubt it makes that much difference in reality, the value is in the data and since this data is unique and from a single source I can't see it mattering.

Not arguing the value of accessibility but in this case it's a nice to have rather than an essential.


The site is clearly designed for people who work in the field, and even then it only took me a moment to find a download link for some data. It even has a workable search function.

I'm not saying it couldn't be better.


I don't think they need to care about UX. The "conversion rate" is probably absurdly low given the need for storage, RAM and CPUs to store and process the data...


http://home.cern/about/computing

"Physicists must sift through the 30 petabytes or so of data produced annually to determine if the collisions have thrown up any interesting physics.

The Data Centre processes about one petabyte of data every day - the equivalent of around 210,000 DVDs. The centre hosts 11,000 servers with 100,000 processor cores. Some 6000 changes in the database are performed every second. The Grid runs more than two million jobs per day. At peak rates, 10 gigabytes of data may be transferred from its servers every second."

Yeah, I don't think it's feasible to release data that can be used to do deep physics.


The logical thing to do is replicate the dataset to the 3 major cloud providers (Amazon, Google and Microsoft) so that anybody can attach their VMs to it with local data access speeds.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: