The post is long and complicated and I haven't read most of it, so whether it's actually any good I shan't try to decide. But the above seems like a very weird argument.
Sure, the code is doing what it's doing. But trying to understand it at that level of abstraction seems ... not at all promising.
Consider a question about psychology. Say: "What are people doing when they decide what to buy in a shop?".
If someone writes an article about this, drawing on some (necessarily simplified) model of human thinking and decision-making, and some experimental evidence about how people's purchasing decisions change in response to changes in price, different lighting conditions, mood, etc., ... would you say "You can just apply the laws of physics and see what the people are doing. They're not doing something more or less than that."?
I mean, it would be true. People, so far as we know, do in fact obey the laws of physics. You could, in principle, predict what someone will buy in a given situation by modelling their body and surroundings at the level of atoms or thereabouts (quantum physics is a thing, of course, but it seems likely that a basically-classical model could be good enough for this purpose). When we make decisions, we are obeying the laws of physics and not doing some other thing.
But this answer is completely useless for actually understanding what we do. If you're wondering "what would happen if the price were ten cents higher?" you've got no way to answer it other than running the whole simulation again. Maybe running thousands of versions of it since other factors could affect the results. If you're wondering "does the lighting make a difference, and what level of lighting in the shop will lead to people spending least or most?" then you've got no way to answer it other than running simulations with many different lighting conditions.
Whereas if you have a higher-level, less precise model that says things like "people mostly prefer to spend less" and "people try to predict quality on the basis of price, so sometimes they will spend more if it seems like they're getting something better that way" and "people like to feel that they're getting a bargain" and so on, you may be able to make predictions without running an impossibly detailed person-simulation zillions of times. You may be able to give general advice to someone with a spending problem who'd like to spend more wisely, or to a shopkeeper who wants to encourage their customers to spend more.
Similarly with language models and similar systems. Sure, you can find out what it does in some very specific situation by just running the code. But what if you have some broader question than that? Then simply knowing what the code does may not help you at all, because what the code does is gazillions of copies of "multiply these numbers together and add them".
Again, I make no claim about whether the particular thing linked here offers much real insight. But it makes zero sense, so far as I can see, to dismiss it on the grounds that all you need to do is read the code.
You’re spot on; it’s like saying you can understand the game of chess by simply reading the rules. In a certain very superficial sense, yes. But the universe isn’t so simple. The same reason even a perfect understanding of what goes on at the level of subatomic particles isn’t thought to be enough to say we ‘understand the universe’. A hell of a lot can happen in between the setting out of some basic rules and the end — much higher level — result.
My entire point is that implementation isn’t sufficient for understanding. Alpha Zero is the perfect example of that; you can create an amazing chess playing machine and (potentially) learn nothing at all about how to play chess.
…so what’s your point? I’m not getting it from those two words.
Understanding how the machine plays or how you should play? They aren't the same thing. And that is the point - trying to analogize to some explicit, concrete function you can describe is backwards. These models are gigantic (even the 'small' ones), they are looking to minimize a loss function by looking in multi thousand dimensional space. It is the very opposite of something that fits in a human brain in any explicit fashion.
So is what happens in an actual literal human brain.
And yet, we spend quite a lot of our time thinking about what human brains do, and sometimes it's pretty useful.
For a lot of this, we treat the actual brain as a black box and don't particularly care about how it does what it does, but knowing something about the internal workings at various levels of abstraction is useful too.
Similarly, if for whatever reason you are interested in, or spend some of your time interacting with, transformer-based language models, then you might want some intuition for what they do and how.
You'll never fit the whole thing in your brain. That's why you want simplified abstracted versions of it. Which, AIUI, is one thing that the OP is trying to do. (As I said before, I don't know how well it does it; what I'm objecting to is the idea that trying to do this is a waste of time because the only thing there is to know is that the model does what the code says it does.)
Sure, good abstractions are good. But bad abstractions are worse than none. Think of all the nonsense abstractions about the weather before people understood and could simulate the underlying process. No one in modern weather forecasting suggests there is a way to understand that process at some high level of abstraction. Understand the low level, run the calcs.
Trying to build systems top-down using principles humans can fit in their head has arguably been a dead end. But this doesn't mean that we cannot try to understand parts of current AI systems at a higher level of abstraction, right? They may not have been designed top-down with human-understandable principles, but that doesn't mean that trained, human-understandable principles couldn't have emerged organically from the training process.
Evolution optimized the human brain to do things over an unbelievably long period of time. Human brains were not designed top-down with human-understandable principles. But neuroscientists, cognitive scientists, and psychologists have arguably had success with understanding the brain partially at a higher level of abstraction than just neurons, or just saying "evolution optimized these clumps of matter for spreading genes; there's nothing more to say". What do you think is the relevant difference between the human brain and current machine learning models that makes the latter just utterly incomprehensible at any higher level of abstraction, but the former worth pursuing by means of different scientific fields?
I don't know neuroscience at all, so I don't know if that's a good analogy. I'll make a guess though - if you consider a standard RAG application. That's a system which uses at least a couple models. A person might reasonably say "the embeddings in the db are where the system stores memories. The LLM acts as the part of the brain that reasons over whatever is in working memory plus it's sort of implicit knowledge." I'd argue that's reasonable. But systems and models are different things.
People use many abstractions in AI/ML. Just look at all the functionality you get in PyTorch as an example. But they are abstractions of pieces of a model, or pieces of the training process etc. They aren't abstractions of the function the model is trying to learn.
Right, I've used pytorch before. I'm just trying to understand why the question of "how does a transformer work?" is only meaningfully answered by describing the mechanisms of self-attention layers at the highest level of abstraction, with any higher level of abstraction being nonsense. More specifically, why we should have a ban on any higher level of abstraction in this scenario when we can answer the question of "how does the human mind work?" at not just the atom level, but also the neuroscientific level or psychological level. Presumably you could say the same thing about this question: The human mind is a bunch of atoms obeying the laws of physics. That's what it's doing. It's not something else.
I understand you're emphasizing the point that the connectionist paradigm has had a lot more empirical success than the computationalist paradigm - letting AI systems learn organically, bottom-up is more effective than trying to impose human mind-like principles top-down when we design them. But I don't understand why this means understanding bottom-up systems at higher level of abstractions is necessarily impossible when we have a clear example of a bottom-up system that we've had some success in understanding at a high level of abstraction, viz. the human mind.
It would be great if they were good, but they seem to be bad, it seems that they must be bad given the dimensionality of the space, and humans latch onto simple explanations even when they are bad.
Think about MoE models. Each expert learns to be good at completing certain types of inputs. It sounds like a great explanation for how it works. Except, it doesn't seem to actually work that way. The mixtral paper showed that the activated routes seemed to follow basically no pattern. Maybe if they trained it differently it would? Who knows. It certainly isn't a good name regardless.
Many fields/things can be understood at higher and higher levels of abstraction. Computer science is full of good high level abstractions. Humans love it. It doesn't work everywhere.
Right, of course we should validate explanations based on empirical data. We rejected the idea that there was a particular neuron that activated only when you saw your grandmother (the "grandmother neuron") after experimentation. But just because explanations have been bad, doesn't mean that all future explanations must also be bad. Shouldn't we evaluate explanations on a case-by-case basis instead of dismissing them as impossible? Aren't we better off having evaluated the intuitive explanation for mixtures of experts instead of dismissing them a priori? There's a whole field - mechanistic interpretability - where researchers are working on this kind of thing. Do you think that they simply haven't realized that the models they're working on interpreting are operating in a high-dimensional space?
Mechanistic interpretability studies a bunch of things though. Like, the mixtral paper where they show the routing activations is mechanistic interpretability. That sort of feature visualization stuff is good. I don't know what % of the field is spending their time on trying to interpret the models in a way that involves higher level, human can explain, approximating the following code type work though? I'm certainly not the only one who thinks it's a waste of time, I don't believe anything I've said in this thread is original in any way.
I... don't know if the people involved in that specific stuff have really grokked they are working in high dimensional space? A lot of otherwise smart people work in macroeconomics, where for decades they haven't really made any progress because it's so complex. It seems stupid to suggest a whole field of smart people don't realize what they are up against, but sheesh it kinda seems that way doesn't it? Maybe I'll be eating my words in 10 years.
They certainly understand they're working in a high dimensional space. No question. What they deny is that this necessarily means the goal of interpretability is a futile one.
But the main thrust of what I'm saying is that we shouldn't be dismissing explanations a priori - answers to "how does a transformer work?" that go beyond descriptions of self-attention aren't necessarily nonsensical. You can think it's a waste of time (...frankly, I kind of think it's a waste of time too...), but just like any other field, it's not really fair to close our eyes and ears and dismiss proposals out of hand. I suppose
> Maybe I'll be eating my words in 10 years.
indicates you understand this though.
Sure, the code is doing what it's doing. But trying to understand it at that level of abstraction seems ... not at all promising.
Consider a question about psychology. Say: "What are people doing when they decide what to buy in a shop?".
If someone writes an article about this, drawing on some (necessarily simplified) model of human thinking and decision-making, and some experimental evidence about how people's purchasing decisions change in response to changes in price, different lighting conditions, mood, etc., ... would you say "You can just apply the laws of physics and see what the people are doing. They're not doing something more or less than that."?
I mean, it would be true. People, so far as we know, do in fact obey the laws of physics. You could, in principle, predict what someone will buy in a given situation by modelling their body and surroundings at the level of atoms or thereabouts (quantum physics is a thing, of course, but it seems likely that a basically-classical model could be good enough for this purpose). When we make decisions, we are obeying the laws of physics and not doing some other thing.
But this answer is completely useless for actually understanding what we do. If you're wondering "what would happen if the price were ten cents higher?" you've got no way to answer it other than running the whole simulation again. Maybe running thousands of versions of it since other factors could affect the results. If you're wondering "does the lighting make a difference, and what level of lighting in the shop will lead to people spending least or most?" then you've got no way to answer it other than running simulations with many different lighting conditions.
Whereas if you have a higher-level, less precise model that says things like "people mostly prefer to spend less" and "people try to predict quality on the basis of price, so sometimes they will spend more if it seems like they're getting something better that way" and "people like to feel that they're getting a bargain" and so on, you may be able to make predictions without running an impossibly detailed person-simulation zillions of times. You may be able to give general advice to someone with a spending problem who'd like to spend more wisely, or to a shopkeeper who wants to encourage their customers to spend more.
Similarly with language models and similar systems. Sure, you can find out what it does in some very specific situation by just running the code. But what if you have some broader question than that? Then simply knowing what the code does may not help you at all, because what the code does is gazillions of copies of "multiply these numbers together and add them".
Again, I make no claim about whether the particular thing linked here offers much real insight. But it makes zero sense, so far as I can see, to dismiss it on the grounds that all you need to do is read the code.