If your point is that there's a very very wide class of problems whose answer is a sequence (of actions, propositions, etc.) -- then you're quite correct.
But that isn't what transformers model. A transformer is a function of historical data which returns a function of inputs by inlining that historical data. You could see it as a higher-order function: promptable : Prompt -> Answer = transformer(historical_data) : Data -> (Prompt -> Answer)
it is true that Prompt, Answer both lie within Sequence; but they do not cover Sequence (ie., all possible sequences) nor is their strategy of computing an Answer from a Prompt even capable of searching the full space (Prompt, Answer) in a relevant way.
In particular, its search strategy (ie., the body of the `prompter`) is just a stochastic algorithm which takes in a bytecode (weights) and evaluates them by biased random jumping. These weights are an inlined subspace of Prompt,Answer by sampling this space based on historical frequencies of prior data.
This generates Answers which are sequenced according to "frequency-guided heuristic searching" (I guess a kind of "stochastic A* with inlined historical data"). Now this precludes imposition of any deductive constraints on the answers, eg., (A, notA) should never be sequenced, but can be generated by at least one search path in this space, given a historical dataset in which A, notA appear.
Now, things get worse from here. What a proper simulation of counterfactuals requires is partioning the space of relevant Sequences into coherent subsets (A, B, C..); (A', B', C') but NOT (A, notA, A') etc. This is like "super deduction" since each partition needs to be "deductively valid", and there needs to be many such partitions.
And so on. As you go up the "hierarchy of constraints" of this kind, you recursively require ever more rigid logical consistency, but this is precluded even at the outset. Eg., consider that a "Goal" is going to require classes of classes of such constrained subsets, since we need to evaluate counterfactuals to determine which class of actions realise any given goal, and any given action implies many consequences.
Just try to solve the problem, "buying a coffee at 1am" using your imagination. As you do so, notice how incredibly deterministic each simulation is, and what kind of searching across possibilities is implied by your process of imagining (notice, even minimally, you cannot imagine A & notA).
The stochastic search algorithms which comprise modern AI do not model the space of, say, Actions in this way. This is only the first hurdle.
> This generates Answers which are sequenced according to "frequency-guided heuristic searching" (I guess a kind of "stochastic A* with inlined historical data")
This sounds way too simplistic of an understanding. Transformers aren't just heuristically pulling token cards out of a randomly shuffled deck, they sit upon a knowledge graph of embeddings that create a consistent structure representing the underlying truths and relationships.
The unreliability comes from the fact that within the response tokens, "the correct thing" may be replaced by "a thing like that" without completely breaking these structures and relationships. For example: In the nightmare scenario of a STAWBERRY, the frequency of letters themselves had very little distinction in relation to the concept of strawberries, so they got miscounted (I assume this has been fixed in every pro model). BUT I don't remember any 2023 models such as claude-3-haiku making fatal logical errors such as saying "P" and "!P" while assuming ceteris paribus unless you went through hoops trying to confuse it and find weaknesses in the embeddings.
You've just given me the heuristic, and told me the graph -- you haven't said A* is a bad model, you've said it's exactly the correct one.
However, transformers do not sit on a "knowledge graph", since the space is not composed of discrete propositions set in discrete relationships. If it were, then P(PrevState|NextState) = 0 would obtain for many pairs of states -- this would destroy the transformers ability to make progress.
So rather than 'deviation from the truth' being an accidental symptom, it is essential to its operation: there can be no distinction-making between true/false propositions for the model to even operate.
> making fatal logical errors such as saying "P" and "!P"
Since it doesn't employ propositions directly, how you interpret its output in propositional terms will determine if you think it's saying P&!P. This "interprerting-away" effect is common in religious interpretations of texts where the text is divorced from its meaning, a new one substituted, to achieve apparent coherence.
Nevertheless, if you're asking (Question, Answer)-style prompts where there is a cannonical answer to a common question, then you're not really asking it to "search very far away" from its inlined historical data (the ersatz knowledge-graph that it does not possess).
These errors become more common when the questions require posing several counterfactual scenarios derived from the prompt or otherwise have non-cannonical answers which require integrating disparate propositions given in a prompt.
The prompt's propositions each compete to drag the search in various directions, and there is no constraint on where it can be dragged.
I am not going to engage with your A* proposition. I believe it to be irrelevant.
> However, transformers do not sit on a "knowledge graph", since the space is not composed of discrete propositions set in discrete relationships.
This is the main point of contention. By all means, embeddings are a graph, as you can use a graph to represent its datastructure, but not a tree. Sure, they are essentially points in space, but a graph emerges as the architecture starts selecting tokens for use according to the learned parameters during inference. It will always be the same graph for the same set of tokens for a given data set which provides "ground truth". I know it sounds metaphoric but bare with me.
The above process doesn't result in discrete propositions like we have in prolog, but the point is, it is "relatively" meaningful, and you seed a traversal by bringing tokens to the attention grid. What I mean by relatively meaningful is that inverse relationships are far enough that they won't usually be confused, so there is less chance of meaningless gibberish emerging which is what we observe.
If I replaced "transformer" in your comment with "human", what changes? That's my point.
Humans are a "function of historical data" (nurture). Meatbag I/O doesn't span all sequences. A person's simulations are often painfully incoherent, etc. So what? These attempts at elevating humans seems like anthropocentric masturbation. We ain't that special!
But that isn't what transformers model. A transformer is a function of historical data which returns a function of inputs by inlining that historical data. You could see it as a higher-order function: promptable : Prompt -> Answer = transformer(historical_data) : Data -> (Prompt -> Answer)
it is true that Prompt, Answer both lie within Sequence; but they do not cover Sequence (ie., all possible sequences) nor is their strategy of computing an Answer from a Prompt even capable of searching the full space (Prompt, Answer) in a relevant way.
In particular, its search strategy (ie., the body of the `prompter`) is just a stochastic algorithm which takes in a bytecode (weights) and evaluates them by biased random jumping. These weights are an inlined subspace of Prompt,Answer by sampling this space based on historical frequencies of prior data.
This generates Answers which are sequenced according to "frequency-guided heuristic searching" (I guess a kind of "stochastic A* with inlined historical data"). Now this precludes imposition of any deductive constraints on the answers, eg., (A, notA) should never be sequenced, but can be generated by at least one search path in this space, given a historical dataset in which A, notA appear.
Now, things get worse from here. What a proper simulation of counterfactuals requires is partioning the space of relevant Sequences into coherent subsets (A, B, C..); (A', B', C') but NOT (A, notA, A') etc. This is like "super deduction" since each partition needs to be "deductively valid", and there needs to be many such partitions.
And so on. As you go up the "hierarchy of constraints" of this kind, you recursively require ever more rigid logical consistency, but this is precluded even at the outset. Eg., consider that a "Goal" is going to require classes of classes of such constrained subsets, since we need to evaluate counterfactuals to determine which class of actions realise any given goal, and any given action implies many consequences.
Just try to solve the problem, "buying a coffee at 1am" using your imagination. As you do so, notice how incredibly deterministic each simulation is, and what kind of searching across possibilities is implied by your process of imagining (notice, even minimally, you cannot imagine A & notA).
The stochastic search algorithms which comprise modern AI do not model the space of, say, Actions in this way. This is only the first hurdle.