Sigh. This is a guy who is clearly passionate about his hobby and was excited to share it with us. Leave it to hn to shit all over it.
Some people fly to Africa and shoot elephants for fun - as far as hobbies go, this is fairly benign. If you care about the environment, I mean actually care, then you should recognize that policing how many computers someone have in their closet is counterproductive. I don't buy the "every little bit helps" argument because there is always an opportunity cost. Attention spans are limited and focusing on this non-issue does nothing but draw attention away from things that actually matter like meat consumption and fossil fuels.
Just to clarify: ArXiv is a pre-print server. By definition anything uploaded there has not (yet) undergone peer review. Posting papers there before they're published in a journal (which can take many months) is standard practice, not "bypassing publishing practices" or a "leak" per se. In this case there seems to be something fishy going on, but the fact that the papers were made public in this way is not strange in itself.
I also hear that it's not uncommon for superconductivity groups to intentionally include "errors" in their early drafts to avoid competitors copying their work before they're ready to publish.
Root is absolutely, mind-blowingly, amazing. It gets a bad rap because it forces you to use primitives that were designed back in the early nineties. If you're "just" trying to analyze some data, your experience will indeed be "horrible" compared to what's offered by Python, R, Matlab, or Julia. But beyond that... Root adds fully working reflection to C++. Root gives you dynamic library loading and reloading - you can fix a bug or add a new feature, recompile parts of your program and keep working without restarting it. Root has a feature complete C++ interpreter, with scripting and a REPL loop. You can work with it completely interactively. After prototyping you can save your code as a script. After identifying performance critical parts of your code, you can compile them and get the full power of bare metal C++, without changing anything about the code. Yes this is technically possible with e.g. python + numba as well, but not as straight-forward. Root is fully interoperable with Python and R - you can mix scripts and REPLs between the languages and pass objects between them. Root can serialize any object, without requiring any custom code whatsoever (some serious dark magic needed for this). In fact you can pause your entire program and save it to disk or send it over the network to keep running somewhere else. Root has its own file format for efficiently storing massive amounts of data in arbitrarily complex structures. It can stream it over the network too, with probabilistic read ahead and caching for maximum efficiency. Root comes with libraries for physics/math/stats that rival those of the largest commercial and open source offerings. Each one of these is a massive technical achievement and Root has had most of them for decades now. Oh, and it has largely maintained backwards compatibility through all this time as well.
Of course, very few people outside of CERN need all of this. Even within CERN, many projects don't. But for those who do, there are very few - if any - alternatives.
Python can do maybe 1 percent of all that. (Hell, Python has real trouble not shitting itself and dying after a "pip install", you can definitely forget about seamless native code compilation.)
But do you really need these features, already available in Matlab/Python/R/Julia/Lisp/? Or did the the C++ folks simply refuse to learn other languages?
From what I have seen in R and Python, the main reason for speed issues are incompetent programmers. Certainly, bad C++ code is much faster than bad Python code, but there is also the effort to build/maintain/document/teach Root to noobs.
Hot take: It's really about preferences, not features.
Let me paint you a picture: You have data coming off the detectors at a rate of a couple of hundreds of GB/s (after pre-filters implemented in FPGAs etc) that needs to be processed and filtered in real time with output written to disk and tape at about one GB/s. We're talking really CPU intensive processing here: Kalman filters, clustering algortihms, evaluating machine learning models. The facility is one of a kind and operating cost is in the billions per year so downtime is unthinkable, this stuff needs to work. Offline, you're running very, very, detailed (and CPU heavy) simulations. All in all, you have some hundreds of petabytes of data that are constantly being processed and reprocessed for hundreds of different purposes. These systems have many millions of lines of code between them, a lot of which needs to be shared between them. Offline analysis needs to re-run online algorithms and so on - you need a single stack for all systems. You have some hundreds of thousands of CPU cores to run all of this. Due to how academia works, beyond a couple of large core datacenters, resources are mostly spread out in hundreds of locations globally so that each participating university can have maintain a cluster on their premises for teaching/research/funding reasons. You need an efficient way to get the data that a program needs to where it is running, or preferably move the program to where the data is. This is not a tech company, there's no revenue so throwing money at the problem is not an option - it's all funded by tax payers so efficiency is paramount. What language do you reach for? Matlab? Lol. The closest analogy I can think of are some big trading systems and large scale ML inference and content serving at FAANG and the like. That's all usually java or C++.
Oh, one more thing: There's very few professional developers dedicated to this. A lot of it is built and maintained by grad students and researchers in-between writing papers. They're smart people, and they can code, but they have neither time nor interest in learning a new language or framework every other year. They move around. A lot. It wouldn't work to have different tech stacks for different projects - you need to pick one solution, not just for one area but for the entire field. So people can spend less time learning and more time doing. There's no one available to migrate legacy code because some new cool language appeared or yesterday's cool library isn't maintained anymore. These projects run for decades. Whatever tech you pick you must be certain that it will still be around and supported 10, 20, 30 years later. That the code still runs and the data that you paid billions for can still be read.
Thanks for the detailed answer, I really appreciate the insight. I work in research myself, so I'm familiar with the general constraints.
I was certainly unaware of the size of the data coming from the detectors. If speed is the argument that beats all others, I rest my case. From what I read on the root.cern website, root is a data analysis and simulation environment, so I was not aware of the aspect of prototyping for online use.
Because I spent a lot of time thinking how software development can work in an analysis heavy research environment, I still would like to comment on some of your points. To distribute binaries and source code, packages work very well for us. Especially if you want to reuse software components in unseen contexts, packages and a package registry makes the most sense.
The use case "re-run online algorithms in offline analysis" is a very familiar one. In my line of work, we do that daily: Switching between online and offline to test + deploy algorithms. Vastly smaller scale, of course. But to us, packages are the first part of the solution. All you do is change the data source. For offline, it's local data or a remote DB, for online, it's an interface such as a websocket.
The second part of the solution are unit- and integration tests. Other users will immediately see what you did (or didn't) test. Again, packages are the distribution system of choice. This has nothing to do with Matlab/Python/R/Julia. Rust has crates.io, JS has npmjs, even Java has something like Maven Central.
Regarding the funded-by-taxpayers argument: The issue I see here is that the cool ML, simulation, data analysis stuff which the CERN people do remains in the root ecosystem. If they used something like PyPI, I could use their stuff too. I have a lot of clustering problems, especially on time series. With a more or less proprietary system like root, I can't use any of CERN's implementations.
Regarding "researchers don't have time to learn new languages": If you look on github.com/root-project/root/issues and root-forum.cern.ch, there are suspiciously many questions regarding "how can I make use of root and Python libraries", and "X doesn't work in root, what do". Newbs have to learn root as well, and they seem to like using Python at least as an enhancement.
> C++ folks simply refuse to learn other languages
root is many many decades old. And I understand that the scripts developed under it are often directly incorporated into C++ applications. CERN is a big C++ user (I have a little experience with their GEANT4 framework), and being able to do everything in one language is a big productivity boost (see for example the rise of node.js for web-related work).
It doesn't really "generate" a field. The EM field is always there, permeating the universe. An acceleting electron will _disturb_ the EM field (depositing some energy and momentum into it), and this disturbance will propagate through the field at the speed of light (naturally, since at the right energy level such a disturbance is what we call light). At high energies the disturbance will be finely localized in space and behave like a particle, which we call a photon. It's fine to refer to it as such also at lower energies, but slightly misleading because at the very low energies that we talk about here ("radio") it is very spread out in space and behaves more like a wave (with a wavelength of ~meters). In the case of AC, electrons are moving "back and forth" over a short distance (somewhat simplified but useful picture) with the same effect. Think about moving your hand up and down through water - you will create a wave.
Also, both E and M fields are required to transport any power using electricity. This is also true for DC current. See https://en.wikipedia.org/wiki/Poynting_vector - the formula multiplies E by B, therefore if any of the two is zero no power (or information) transmission can occur. This is true over the air, as well as over a wire.
Hmm sounds dangerously close to aether theory. I thought we had moved past this … that EM waves exist in their own right without need for a medium. Unlike sound say.
Aether refers to some historical ideas (which evolved over time), and though the word isn’t in use anymore the current model of fields has similarities (ofc in many ways very different). Have just been reading some of the history of this (Einstein plays a big role), it’s very interesting, thank you! :) You might be interested in this discussion https://news.ycombinator.com/item?id=27942970
Wilczek points out that field theories are aether theories — as is relativity.
“Aether just means this one particular historical model!” is branding more than reality; and hides the fact that modern theories also describe an everywhere substrate of which particles are localized excitations and gravity is localized warping.
If you are looking for quality I suggest you look at the IPCC reports. Each word is carefully chosen, every claim backed by mountains of evidence. They're written to be read and understood by non-experts. They exist to inform decision making that will literally determine the fate of our species. As such, they may be failing at their goal, but not for a lack of effort by the authors.
I disagree. The IPCC may contain a lot of scientific evidence but the way it is presented is highly political. Every word in it had to be negotiated and that is not the same as “carefully chosen”. If you want to dive into climate change I’d suggest to take a look at “The Warming Papers” [1], an edited compilation by David Archer and Raymond Pierrehumbert.
Agreed, with one caveat. Over time governments have more and more been trying to influence the reports. It all came to a head with the last one, where a set of researchers released their draft ahead of time in protest over undue influence. They still synthesize relevant research from past years, but there are some problems now. Research published after the first draft cannot be included, so the report is somewhat outdated by the time it's released. Beautifully crafted though.
Yeah this is spaced repetition. In this case the purpose is to memorize facts. But the asterisk method asks if you understand a concept. The purpose of revisiting asterisks is not to aid memorization via spaced repetition but to check if the understanding has improved after reading additional chapters. This seems like an important distinction.
The NZZ, for what it is worth, is probably the most highly regarded newspaper in Switzerland when it comes to journalistic integrity. It has a clear political bias to the right, but very high credibility.
Some people fly to Africa and shoot elephants for fun - as far as hobbies go, this is fairly benign. If you care about the environment, I mean actually care, then you should recognize that policing how many computers someone have in their closet is counterproductive. I don't buy the "every little bit helps" argument because there is always an opportunity cost. Attention spans are limited and focusing on this non-issue does nothing but draw attention away from things that actually matter like meat consumption and fossil fuels.