Really exciting article from Hutter et al. I feel like I’m beating a dead horse with how much I comment about Solomonoff induction on here, but I am constantly both amazed and perplexed that we don’t have thousands of research labs going about AGI via a top-down, theory-first approach—i.e., starting with the upper bound of what is possible and working down to the nearest resource-constrained physical approximation to it.
Can we bootstrap our way to basic AGI by tinkering with neural network architectures and scaling up the hardware? Maybe. After all, the human brain sort of echos that approach. But at some point, in order to sustain continued improvement toward optimal AGI, there will have be a strong algorithmic component to sequence prediction that is based upon a very deep understanding of what is computable (assuming the physical Church-Turing thesis). I don’t really see a way around that, because the foundational principles of algorithmic and computational complexity ultimately determine the upper limit of our ability to predict the future, which is kind of mind-blowing to me (and even more so considering that much of the theory was developed over half a century ago).
But wait! What about the halting problem, NP hardness, the NFL theorem, Gödel’s incompleteness theorems, Blum’s speedup theorem, ..., [insert your favorite pessimistic no-go theorem]? Yeah, so what? Most of these nonstarters apply to “almost all” valid problems, which ironically happen to overlap with “almost none” of the problems we care about, because the distribution of real world problems does not coincide with the distribution of problems randomly sampled from a formal language. If that were the case, then nothing would be predictable at all because prediction-making beings could not exist in such an environment (in other words, real world problems tend to exhibit Kolmogorov-compressibility in the form of mathematical substructure that leads to heuristic solvers that are particularly effective beyond what average-case complexity would imply).
I also find this stuff fascinating, but I'm not sure Kolmogorov complexity (and therefore Solomonoff Induction) are useful models of any form of intelligence that matters in the real world. The main issue I see is that Komogorov Complexity is dominated by noise. The majority of bits in the smallest program that generates a given sequence will be devoted to reproducing arbitrary/accidental features of the sequence, rather than its computational structure. For example, if a Solomonoff Induction machine sees a data stream produced by someone flipping a coin, its model of that process will be... the data stream itself. To me, that's not intelligent behaviour. I would like an intelligent agent to "average out" the noise and say "this data stream is produced by sampling from a Bernoulli process with p = 0.5".
This is why I like Crutchfield's Epsilon Machine formalism[1]. He factorizes Kolmogorov Complexity into a noise part (Shannon Entropy) and a "computational part", which measures the structural complexity of the smallest probabilistic finite state machine that is an optimal predictor of the data sequence, under some coarse-graining. If the size of the optimal machine gets bigger and bigger the more fine-grained your measurement, then there's no finite representation of the sequence as a FSM, so you "jump up one level" to a stack machine and try again.
In the coin-flipping example above, the epsilon machine will correctly factor out all of the noise in the sequence, leaving you with a simple probabilistic model of a coin-flip process.
This procedure is also computable, and even practical: you can implement the FSM-only algorithm in a couple hundred lines of code.
As the guy who suggested to Marcus a lossless compression prize to replace the Turing Test, I've got to confess that all this pedantic sophistry "critiquing" algorithmic information is there for a good reason. In the immortal words of Mel Brooks: "We've got to protect our phoney baloney jobs gentlemen!"
There is actually more at stake here than machine learning. This gets to the root of "bias" in the scientific method. Imagine what horrors, what risks, what chaos would be ours if a truly objective information criterion for causal model selection were to exist! Why, virtually every "sociologist" would be hauled to Hume's Guillotine in a Reign of Terror!
But to be clear, Marcus and I have a disagreement about pragmatics of such an approach to dispute processing in the natural sciences. He believes, for example, that the dispute over climate change should be handled by the standard processes in place with academia. My approach differs, based on my hard won experience with reforming institutional incentives:
When it comes to multi-trillion dollar scientific questions, the conflicts of interest become so intense that you really need to apply a gold standard for objectivity and that is the single number: How big is your executable archive of the data in evidence.
While I understand the machine learning world looms as a rival for "unbiased" academic research, it nevertheless remains true that even in this emerging "marketplace of ideas", there is no formal definition of "bias" that disciplines discourse and thereby guides development at the institutional, let alone technical level. Everyone is weighing in with their fuzzy notions of "bias" that betray intense motivations when there has been, for over 50 years, a very clear and present mathematical definition.
When you mention The Objective Information Criterion.
I assume you mean something that takes into account the complexity of the model. A model with more parameters might fit the data better but could also be at risk of overfitting.
But isn’t there already a blackbox measure of overfiting one that focus on how well the model generalizes to new, unseen data, rather than on the model’s complexity. Like Cross-Validation, hold-out validation or bootstrap method.
That's what all "information criteria for model selection" are about. The difference is that Algorithmic Information is the only such information criterion that has been proven (by Solomonoff) optimal under the assumptions of natural science.
I see your pathological proof of existence and raise you an asymptotic probability of one:
“The halting problem for Turing machines is perhaps the canonical undecidable set. Nevertheless, we prove that there is an algorithm deciding almost all instances of it.”
The asymptotic density of a subset of the natural numbers, is defined as the limit of (number of elements of the set which are less than N)/N , as N goes to infinity.
Ok, by this definition I believe if you consider set of natural number without say n^2 numbers, then density will be 1, or "almost all", while class we ignored and similar classes are quite large and important.
Yes, the asymptotic density of “natural numbers which are not perfect squares” is 1. I.e. “almost all natural numbers are not perfect squares”.
And similarly for many other important classes.
Almost all natural numbers are not prime. Etc. etc.
Yes that's why it would be something like a Gödel machine[0]. Ignore everything it can't prove is not malicious (won't halt, runs too long, takes too many resources, etc.).
Marcus Hutter invented AIXI (Hutter 2000), the theoretically optimal AI system. It's not computable though, and the computable version is hilariously inefficient. Yes it uses Solomonoff induction.
I actually don't know. They say in the paper they can't figure out a better way to approximate Solomonoff induction in the UTM (universal Turing machine) section, so they just guess some reasonable value (based on knowledge of the program that generates the sequence).
Interesting that they combine assumptions from highly idealized abstractions (universal Turing machines, Solomonoff induction) with realistic neural network architectures.
They have been doing this forever. Here is a paper by some of the same authors: A Monte Carlo AIXI Approximation (2009). https://arxiv.org/abs/0909.0801
Can we bootstrap our way to basic AGI by tinkering with neural network architectures and scaling up the hardware? Maybe. After all, the human brain sort of echos that approach. But at some point, in order to sustain continued improvement toward optimal AGI, there will have be a strong algorithmic component to sequence prediction that is based upon a very deep understanding of what is computable (assuming the physical Church-Turing thesis). I don’t really see a way around that, because the foundational principles of algorithmic and computational complexity ultimately determine the upper limit of our ability to predict the future, which is kind of mind-blowing to me (and even more so considering that much of the theory was developed over half a century ago).
But wait! What about the halting problem, NP hardness, the NFL theorem, Gödel’s incompleteness theorems, Blum’s speedup theorem, ..., [insert your favorite pessimistic no-go theorem]? Yeah, so what? Most of these nonstarters apply to “almost all” valid problems, which ironically happen to overlap with “almost none” of the problems we care about, because the distribution of real world problems does not coincide with the distribution of problems randomly sampled from a formal language. If that were the case, then nothing would be predictable at all because prediction-making beings could not exist in such an environment (in other words, real world problems tend to exhibit Kolmogorov-compressibility in the form of mathematical substructure that leads to heuristic solvers that are particularly effective beyond what average-case complexity would imply).