Transformer as a general purpose computer

Hugsun · on April 10, 2024

The title is a little misleading. The post makes an analogy between a turing machine and a transformer and and expands on that analogy a bit. The substance of the article is an explanation of the design of a standard transformer.

It does not use transformers as a computer in any novel way. It just makes the analogy.

typon · on April 10, 2024

For those unaware of Alex Graves' work, check out the NTM paper: https://arxiv.org/abs/1410.5401

This actually does what this blog post pretends to

nyrikki · on April 10, 2024

Except it proves nothing about TC, it just refers to a paper that demonstrated RNNs were TC.

The authors decision to name their creation a "Neural Turing Machine" doesn't make it computationally equivalent.

The fact that it explicitly states:

> ...analogous to a Turing Machine or Von Neumann architecture but is differentiable end-to- end, allowing it to be efficiently trained with gradient descent.

Actually demonstrates it is not equivalent.

Note that I am not saying their approach is wrong, but that their overloading of terms suggests more than they claim.

While 'general-purpose computer' is ambiguous, a Turning machine is not as opaque.

LLMs can be TC with infinite space for input and infinite width.

For finite input, size, and depth, LLMs aren't even equivalent to primitive recursive functions IIRC.

With consent depth threshold circuits being a recent claim as the limits of LLMs.

JohannesVoderho · on April 11, 2024

Original creator of the blog here. Yes, you are right. The title of the blog is a bit misleading. I originally wrote this for a course from my University and thought it was interesting so integrated it into my website. I didn't expect to get that much traffic on the blog and didn't intend to do any clickbaiting. I just wanted to extend on what Andrej karpathy mentioned in one of his tweets: https://twitter.com/karpathy/status/1582807367988654081. I think i also linked it in the blogpost. I think this view on transformers is a very powerful one and explains in the best way why the transformer is build that way. This also allows us to reason about certain improvements for the transformer. For example weight tying some layers to make it easier to map to algorithms that have a lot of repeated operation. The Goal of the blog was not to use the transformer as a general purpose differentiable computer but to convince you that the transformer already IS a general purpose differentiable computer. Maybe i will create another blog where i train the transformer on simple algorithmic problems like sorting or copying data, just like in the differentiable turing machine, or even show how you can hard code the transformer weights to solve them. What do you think would be a better title for the blog?

Hugsun · on April 14, 2024

That makes sense. It definitely read like a university project to me.

One idea is "Transformers, analogous to general purpose computers" although that doesn't mention the meat of the article. "Transformers from scratch, a turing machine analogy" is another idea.

I'm not sure they're good titles but they don't imply that it's about using a transformer as a computer.

With the goal you state, I understand better how you came to the title.

f_devd · on April 10, 2024

Interesting, I thought it was proven Transformers cannot do recursion[0] and aren't actually able to be equivalent to a turing machine due to lack of expandable memory, and are rather finite state automata.

[0]: https://youtu.be/rie-9AEhYdY?si=5p0QCxAFovGUHFRb&t=1744

trashtester · on April 10, 2024

I'm pretty sure any physically realizable computer is a Finite State Machine (FSM).

The Turing Machine assumptions of infinite memory and unbounded time lead to a lot of poor intuitions about them, in my experience.

That bad intuitions would also apply to FSM's, though. Or to the human brain for that matter.

For instance, a well trained human brain may be Turing Complete, if they can access some external storage (like pen&paper). This leads to confused concepts such as the Chinese Room argument.

Even though a human brain can contain a Turing Machine, it doesn't mean that understanding a Turing Machine gives much relevant understanding of how the Brain normally operates. The native compute model of the brain is completely different.

nyrikki · on April 10, 2024

LLMs are pessimisticly constrained to being equivalent to TC0 or more specifically are uniform consent depth threshold circuits.

Circuits and TMs are very different, but while LLMs can use threshold circuits, they do not have access to finite size input parity circuits, nor can they do operations like xor in a single layer.

It is the cost of high parallelism.

For those who are trying to figure out real world applications, it is worth the effort to learn circuit complexity, as it will be far more realistic than TM.

https://complexityzoo.net/Complexity_Zoo:T#tc0

golol · on April 10, 2024

People always makes these points of things not being technically Turing machines because they are finite, they have finite memory. Well, obviously! Your computer has finite memory too. It doesn't matter in practice, and the conceptual idea that this thing can do arbitrary computations is more important.

f_devd · on April 10, 2024

The argument isn't finite memory but rather the lack of it being expandable memory; as in it cannot learn to read/write to a hypothetical infinite tape. (Although I'm maybe not the right person to word it, I recommend the prior linked video)

fl7305 · on April 10, 2024

> it cannot learn to read/write to a hypothetical infinite tape

Modern LLMs can perform external function calling. How is that not the same as having access to a potentially infinite memory?

Conversely, your brain does not contain infinite storage. So it is clearly not Turing complete in the strictest sense. Does that tell us anything useful about the level of intelligence in a brain?

f_devd · on April 10, 2024

> Conversely, your brain does not contain infinite storage.

But you can address it and use it, you can think of a system which can make use of infinite memory and then execute that system.

LLMs they cannot use something like `set(0x8080, "ABC")` and later retrieve that with `get(0x8080)` without keeping the initial command in it's context therefore it's memory is not expandable. Likewise LLMs can call a external function but the input of the LLM is a fixed context on which it's trained, and it has a fixed amount of time to use that context so it's not expandable.

Now this is not necessarily a limitation of the architecture, but rather how we train NNs. It's also not a necessity to do something useful but the efficient/optimal solution for many problems are often turing-complete and are thus not findable or even searchable by NNs.

The video goes into more depth everything metioned, and on the distinction between turing machines and NN limitations.

drdeca · on April 10, 2024

A finite state machine can control the head of a Turing machine (as, that is what controls the head of a Turing machine).

A transformer could be trained to implement a particular finite state machine (using a fixed length context window). Of course, so could a fully connected 1-hidden-layer feed-forward neural network. Though, a lookup table would probably be more practical than either of those.

I'm not saying much here, just that "technically, a transformer equipped with interaction with another system, can be Turing complete (though not necessarily in a way that matters)." .

Giving a finite state machine the ability to read and write finite amounts of information to memory addresses (where each address can only store a finite amount of info), only goes beyond being a (perhaps much larger) finite state machine, if there are infinitely many memory locations that can be used, and so these locations can't be limited to ones that have an address which can be specified entirely in terms of a state of the finite state machine.

the following is a tangent:

Sometimes I wonder about, "If we wanted to design a computer that used silicon chips and other such electrical components, but was modular in a way that would make it technically Turing complete as long as we connected (and provided adequate power to) additional modules whenever it tried to use the next one it is missing, what would be a good way to design that?".

A single tape of symbols, or even several tapes, where it can only move one position at a time (per tape), seems probably inefficient (though perhaps not in any asymptotic sense). But, if one can only jump to positions labeled by addresses of a bounded length, that's no good. Therefore, I think that maybe a good solution would be to use relative addresses, where the offsets have a maximum size of 2^32 or 2^16 or something. I think that, in order to prevent computation at later memory addresses from being slower, having to send the information back and forth between the module where it is stored and the information describing the FSM/program, it might be good to have either all the modules, or some fraction of them, contain copies of the hardware for running the computation steps, not just forwarding and implementing messages to read/write. Though, in that case, it does seem a bit of a waste to not use parallelism. Alternatively, I suppose having a literally-moving head, which stores the FSM, and moves between the different memory modules, might make sense?

golol · on April 10, 2024

The nature of computation is just not about extendability of memory for me. It's just a technicality.

thomashop · on April 10, 2024

Isn't recursion solved by just feeding the output of the LLM (with some additional instructions) back into the input? That's how a lot of complex problems are already tackled with LLMs.

westurner · on April 10, 2024

Perhaps similarly, neural activation patterns in biological neural networks are cyclic: there are cycles in the graph (which has oscillating voltage due to the heart being an electrical generator in addition to a pump).

questinthrow · on April 10, 2024

Could Transformers/LLMs take over custom software implementations given enough training? Why have so many custom ERP software suites or anything else related to the backend if a transformer can do the work?

lolinder · on April 10, 2024

> Why have so many custom ERP software suites or anything else related to the backend if a transformer can do the work?

Because they can't. Transformers are great at probabilistic tasks that traditional computing can't do well, like natural language processing. They're very suboptimal for the tasks that traditional computing is already good at.

The first and most obvious reason is that traditional computing is deterministic and predictable in its behavior. This is great for business systems because errors can be tracked down to specific lines of code that can be demonstrably fixed and regression checked. A transformer doesn't have this property—when something fails you can try tweaking parameters like temperature or fine tuning it longer, but it's hard to prove that the specific bug won't show up again.

The second reason is cost—even saying you can make a reliable transformer-based ERP system, in the long run it won't cost less than the traditional equivalent. The initial training cost can be enormous and you still need highly paid staff to organize that work. Once you have the working system, your business will still need to make changes to the rules over time, so you'll still need to keep some of those staff on hand to make those changes, but you also have to pay for the GPU time to run the application, which will cost much more than the equivalent traditional ERP.

Finally, even if all of that sounds fine, at some point you need to store the data and host a web server of some kind, which means you're going to need to interface with deterministic computation. I guess you could go the tool use route and let the LLM generate its database queries on the fly, but that makes all the reliability problems in the first paragraph much much worse. And if you're going to hire web devs anyway, why not just have them do everything?

teleforce · on April 10, 2024

Just wondering if we to combine stochastic Transformer approach with its deterministic NLP cousin namely Typed Feature Structure Grammar (TFSG) as implemented inside CUE language (non Turing), is it possible together in combination with Transformer become the generic computing backend [1],[2]?

Before transformer NLP people use deterministic approach like TFSG as being employed by CUE language [2],[3].

[1] The Logic of CUE:

https://cuelang.org/docs/concept/the-logic-of-cue/

[2] CUE foundations:

https://cuetorials.com/overview/foundations/

[3] Feature structure:

https://en.wikipedia.org/wiki/Feature_structure

hnax · on April 10, 2024

Very good (and concise) explanation. Thanks!

typon · on April 10, 2024

Unless you value correctness and safety, it's probably not a great idea to use a probabilistic algorithm as a backbone for your software

CamperBob2 · on April 10, 2024

Or unless you have additional feedback mechanisms for correctness and safety. Transformers are components -- subsystems at best, not complete systems.

And at some point, "good enough" always wins over "perfect" in real-world systems.

Always. No exceptions.

janpmz · on April 10, 2024

I think the problem is that small errors compund, even if each step in a process has a low chance of going wrong.

cma · on April 10, 2024

Deepmind found transformers get beat out on learning to reproduce outs towards the general computer end of the Chomsky formal language heirarchy by neural turing machines and even LSTMs:

https://arxiv.org/abs/2207.02098

colesantiago · on April 10, 2024

As others have said, the title does not match the contents of the article and the way this has been explained looks like a very opaque and highly abstract rube goldberg machine with limited use cases.

What does this solve? Even by going with the flawed title, this mostly is a solution in search of a problem.

JohannesVoderho · on April 11, 2024

Original creator of the blog here. I originally wrote this for a course from my University and thought it was interesting so integrated it into my website. So i didn't want to "solve" anything. I just wanted to extend on what Andrej karpathy mentioned in one of his tweets: https://twitter.com/karpathy/status/1582807367988654081. I think this view on transformers is a very powerful one and explains in the best way why the transformer is build that way. This also allows us to reason about certain improvements for the transformer. This was mostly an educational blog

typon · on April 10, 2024

It solves language understanding in a practical sense - a task heretofore considered impossible

colesantiago · on April 10, 2024

This has been already solved even without LLMs for years: Google Translate, DeepL and Firefox Translate have done this use-case already, much more efficiently than LLMs.

So what does this particular use case of a transformer as a general-purpose computer solve that hasn't already been solved before? Why do we need this?

lolinder · on April 10, 2024

> Google Translate, DeepL and Firefox Translate have done this use-case already, much more efficiently than LLMs.

They may have been more efficient, but the transformer-based Google Translate we have today is way better at its job than its predecessors, and the old Google Translate wasn't capable of other NLU tasks that can now all be done by transformers (and again, all better than the previous specialized tools).

I agree that transformer-as-Turing-machine isn't really solving any problems, but transformers have been nothing short of revolutionary for tasks that have usually been hard for plain Turing computation.

raidicy · on April 10, 2024

Please correct me if I'm wrong, neural Turing machines were a good step but because Transformers do everything in parallel and don't have to read a tape and they more powerful but less efficient?

VHRanger · on April 10, 2024

1. There's nothing particularly special about Transformers, except that it was the first architecture to scale.

Transformers scale well because they leverage a GPU well. The drawback to this is the attention mechanism (RAM hungry and finite autoregressive window)

2. Transformers aren't "computers" in the turing machine sense given their finite state. You could make that argument for models with hidden states, though, like RNNs/LSTM, SSMs/Mamba, RWKV. THe issue with those is that the hidden state makes them harder to train at scale.

throw10920 · on April 10, 2024

> 2. Transformers aren't "computers" in the turing machine sense given their finite state

That's true, but then the things we normally think of today as "computers" also aren't, because of their finite amount of memory. I think that the author is implying that, like linearization, we can approximate transformers as being Turing-complete for a small range of operation (that's much larger for "real" computers).

HarHarVeryFunny · on April 10, 2024

Well, scaling is big deal since that is where a lot of the power comes from. The way transformers leverage parallel hardware by processing tokens in parallel also makes them more general than just sequence-to-sequence, maybe more like graph-to-graph, which is what has allowed them to also be used for things like vision.

There's a common sentiment that data/scale is more important than architecture, and that other architectures would perform just as well if you scaled them up (to degree that is practical of course), but I'm not sure that is totally true. Of course scale is vital to performance, but scaling the wrong architecture isn't going to get you there. For example, you need the ability to attend to specific tokens, and munging the sequence history into single vector the way RNNs do is going to cap performance, which is why you see hybrid architectures like Jamba (Mamba + Transformer).

I think there is something special about Transformers - something that the developers accidentally got right! The key-based attention mechanism and the way attention heads in adjacent layers can pair up to form induction heads seems to be what makes them unreasonably effective at learning from language...

It's a shame there hasn't been much (any?) discussion of this that I'm aware of - perhaps just because the detailed architecture was more "accidental" than strategic. It seems the history is that Jacob Uszkoreit had the basic idea of language being more hierarchical than sequential, and therefore amenable to parallel processing (with a multi-layer architecture), but in initial implementations wasn't able to get the performance to beat other similar contemporary approaches such as using convolution. Noam Shazeer was apparently the wizard who took the basic idea, threw a lot of inspiration/experience (plus the kitchen sink?) at it, and was able to make it perform - apparently coming up with this specific attention mechanism. There was then an ablation process to simplify the architecture.

VHRanger · on April 10, 2024

I have a blog post coming in a few days on this topic.

But basically, I agree, architecture doesn't really matter. The number of parameters and the training data matter.

The tradeoff transformers make is that you saturate the hardware really efficiently (so: easy to parallelize & scale) but the tradeoff is the n^2 scaling of the attention mechanism.

This turns out to be a great tradeoff if what you're doing is scaling models to 7B+ parameters.

HarHarVeryFunny · on April 10, 2024

Uszkoreit has mentioned that the global/quadratic attention was (paraphrasing) considered as overkill, but the brute force parallelism this simple approach allowed made that irrelevant.

But of course that changes when you scale up the context size enough, and the fix is simple since the key insight of the architecture was the hierarchical tree-like nature of language and thus dependence mostly on local (within branch) context, not global context. In Google's Big Bird attention (from their Elmo/Bert Muppet era!) they basically use sliding window local attention, but augmented with a fixed number of global tokens with global attention, and some random attention to further back non-local tokens. This mixture of attention patterns performs almost as well as global attention. Part of the reason (aside from the mostly local nature of language) is that as you ascend the transformer-layer hierarchy, receptive field sizes increase (same as they do in a CNN), so even with the attention gaps of random attention, there is still visibility at higher layers.

haswell · on April 10, 2024

> There's nothing particularly special about Transformers, except that it was the first architecture to scale

This seems like a problematic framing. Is that not the definition of particularly special?

coldtea · on April 10, 2024

Only in the sense of "has an attribute others we tried before didn't have". Not "uniquely possesses that attribute".

haswell · on April 10, 2024

That doesn’t make it any less important or remarkable though?

I guess I’m struggling to understand the push to trivialize it.

Transformers came onto the scene and the entire space exploded. Setting aside any technical/theoretical lack of uniqueness, it’s hard to ignore the real world results.

coldtea · on April 10, 2024

>That doesn’t make it any less important or remarkable though?

In the historical sense?

Because in the practical sense, if it's just the first to have this attribute, but we find others with the same attribute (as we apparently did), then it's not really important or remarkable anymore.

logicchains · on April 10, 2024

Transformers are special because theoretically you can't make something "weaker" than a transformer without it losing some accuracy: https://arxiv.org/abs/2209.04881 .

thesz · on April 10, 2024

> Despite a remarkable amount of algorithmic effort on Boolean satisfiability (SAT) and related problems, to date no one has invented an algorithm with faster-than-exponential (O(2^n)) running time; indeed, there is no polynomial-time algorithm for SAT unless P = NP.

  [1] https://www.hindawi.com/journals/complexity/2018/7982851/
  [2] http://www.cs.cornell.edu/~sabhar/publications/learnIJCAI03.pdf

The [1] shows how to replace numerical problem solving process with the SAT (circuit) based one and obtain exponential speed up.

Let me quote the [2]: "We also show that without restarts but with a new learning scheme, clause learning can provide exponentially smaller proofs than regular resolution, which itself is known to be much stronger than ordinary DPLL."

So, in my opinion, there are algorithms that work in O(2^(n(1-e))) time.

VHRanger · on April 10, 2024

Yes, Transformers are probably close to the Pareto frontier in terms of efficiency if you want to train something that has a sequence as an input.

But they're not inherently *special*. There's a bunch of other model types around that pareto frontier. Transformers are just good at saturating memory bandwidth, which is the hardware frontier.

EGreg · on April 10, 2024

I think of LLM as a non-von-neumann architecture that basically can compute neural network stuff in parallel. You throw a bunch of training data at it, making it do predictions, and it shifts the gears and finds the model that gets close to modeling the latent space. That’s it.

Alan Turing did it with Bombe machin modeling the ENIGMA didn’t he?