Async programming is great. Coroutines are a powerful tool, both for expressing your ideas more clearly and for improving performance in IO-heavy systems.
async/await syntax may not be the best design for async programming though. Consider example in Julia:
function foo(x)
@async print(x) # some IO function
end
function bar(x)
@sync foo(x)
end
`foo()` returns an asynchronous `Task`, `bar()` awaits this task, and you can invoke `bar()` from whatever context you want. Now look at the Python version with async/await keywords:
async def foo(x):
print(x) # some IO function
def bar(x):
await foo(x)
# SyntaxError: 'await' outside async function
Oops, we can't make `bar()` synchronous, it MUST be `async` now, as well as all functions that invoke `bar()`. This is what is meant my "infectious" behavior.
Maybe we can wrap it into `asyncio.run()` then and stop the async avalance?
def bar(x):
asyncio.run(foo(x))
bar(5)
Yes, it works in synchronous context. But path to asynchronous context is now closed for us:
async def baz(x):
bar(x)
await baz(5)
# RuntimeError: asyncio.run() cannot be called from a running event loop
So in practice, whenever you change one of your functions to `async`, you have to change all its callers up the stack to also be `async`. And it hurts a lot.
Can we have asynchronous programming in Python without async/await. Well, prior to Python 3.5 we used generators, so it looks like at least techically it's possible.
Python actually does something pretty neat: it chains comparisons so that `x < y <= z` is like `x < y and y <= z` except y is only evaluated once
In linked code we can be confident that `0 <= 1`, so only `1 < ndim` should matter. In fact I'd expect peephole optimization to remove most of the code for `0 <= 1`
So is this the case that the information is in the data set? Or the code is very well defined to be so small? As an outsider it's surprising that such a capable model can be so "simple".
The training code is presumably quite a bit more complex than what they've open sourced, but part of the beauty of the GPT-based LLMs is their structural simplicity.
Now, that simplicity can be deceiving - there are a lot of conceptual interconnectedness within these models. They've been put together "just so" if you will.
If you look at the source code to nanoGPT and compare it to Llama3, the most remarkable thing (when you look past the superficial name changes) is just how similar they are.
If I recall correctly the primary differences are:
- The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x + proj(gelu(expand(x))) in GPT2
- The token encoders, which is arguably external to the model
- Attention: Llama3 uses Grouped Query Attention, vs full Multi-Head Attention in GPT2
- Normalization: Llama3 uses RMSNorm, vs LayerNorm for GPT2
They were published more than five years apart. On the one hand progress has been breathtaking, truly astounding. On the other hand, it's almost exactly the same model.
Goes to show just how much is in the training data.
I think with LLMs in general, the algorithms are very refined and require lots of research, despite being "simple" in terms of entropy, or an imagined Kolgomorov complexity for defining algorithms.
So "simple" is a fuzzy term here, but yes, the entropic complexity is in the data, not the algorithms.
Related to the so-called "Bitter lesson".
Edit: the sister comment pointed out what I failed to express: RILHF and training are also algorithms, and their applications and implementations are probably much more complex than the code that evaluates a given prompt.
So basically, "models" (trained NNs) are also an example for the equivalence of code and data.
Fixed data used by code (the trained model) is code in itself, even when it is not directly written by humans or in a human-readable language.
Edit edit: don't forget to count the imported maths code :)
but I assume this is not relevant to the "it's just matrix multiplications" overall argument
300 lines of this code is a bit different than 300 lines of typical code where you read files, set up a backend/frontend, or parse data. In the latter case, there are a lot of tedious operations. Sure, the former also has that with reshaping and asserts or wtv.
But in a sense, the 300 lines of Llama code are essentially just lines of math. And reading through any math proof will show you that any particular line can hide large amounts of complexity.
This can be true with code with more tedious operations, but those lines are a smaller fraction of the overall code base by definition.
Even the "tedious" parts of the llama code can hide large complexity. Setting a learning rate with a schedule might require reading a paper or two for your particular architecture.
But yes, once you parse all the math and the theory, the lines are kinda simple matmul and forward lol.
Sure, knowing the basics of LLM math is necessary. But it's also _enough_ to know this math to fully grasp the code. There are only 4 concepts - attention, feed-forward net, RMS-normalization and rotary embeddings - organized into a clear structure.
Now compare it to the Hugginface implementation [1]. In addition to the aforementioned concepts, you need to understand the hierarchy of `PreTrainedModel`s, 3 types of attention, 3 types of rotary embeddings, HF's definition of attention mask (which is not the same as mask you read about in transformer tutorials), several types of cache class, dozens of flags to control things like output format or serialization, etc.
It's not that Meta's implementation is good and HF's implementation is bad - they pursue different goals in their own optimal way. But if you just want to learn how the model works, Meta's code base is great.
That's just the default. You can set max_seq_len to 8k. From the readme [0]:
> All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.
the simplicity of the transformer is quite refreshing. especially in vision where the Vision Transformer with linear patch encodings replaces complex intertwined decisions about filter size, striding, pooling, #filters, depth, etc., with the simpler decision of how to allocate your FLOPS between dimensionality, #heads, and #layers.
JAX requires a bit more work to maintain fixed-size buffers as required by XLA, especially in case of caching and rotary embeddings. But yeah, overall the code can be pretty similar [1].
I have quite limited experience with Cython and tried Numba just a couple of times, but I'm curious how much would it take to rewrite one of my Julia libraries to them.
The library is for reverse-mode automatic differentiation, but let's put AD itself aside and talk about code generation. As an input to code generator, I have a computational graph (or "tape") - a list of functions connecting input and intermediate variables. As an output I want a compiled function for CPU/GPU. (Note: Theano used to do exactly this, but that's a separate huge system not relevant no Numba or Cython).
In Julia I follow the following steps:
1. Convert all operations on the tape to expressions (~approx 1 line per operation type).
2. Eliminate common subexpressions.
3. Fuse broadcasting, e.g. rewrite:
c .= a .* b
e .= c .+ d
into
e .= a .* b .+ d
Dot near operations means that they are applied elementwise without creating intermediate arrays. On CPU, Julia compiler / LLVM then generates code that reads and writes to memory exactly once (unlike e.g. what you would get with several separate operations on numpy arrays). On GPU, CUDAnative generates a single CUDA kernel which on my tests is ~1.5 times faster then several separate kernels. Note that `.=` also means that the result of operation is directly written to a (buffered) destination, so it no memory is allocated in the hot loop.
4. Rewrite everything I can into in-place operations. Notably, matrix multiplication `A * B` is replaced with BLAS/CUBLAS alternative.
5. Add to the expression function header, buffers and JIT-compile the result.
In Python, I imagine using `ast` module for code parsing and transformations like common expression elimination (how hard it would be?). Perhaps, Numba can be used to compile Python code to fast CPU and GPU code, but does it fit with AST? Also, do Numba or Cython do optimizations like broadcasting and kernel fusion? I'd love to see side-by-side comparison of capabilities in such a scenario!
there's nothing in the language that prevents this from working with the autograd package, except no one's taken the time to implement it (https://github.com/HIPS/autograd/issues/47). That said, for many tasks with wide vector data, a DL framework is going to do ok, e.g. PyTorch.
> Julia compiler / LLVM then generates code that reads and writes to memory exactly once (unlike e.g. what you would get with several separate operations on numpy arrays)
Numba's gufuncs address exactly this + broadcasting over arbitrary input shapes. I've used this extensively. That said, I don't find fusing broadcasting is always a win, especially when arrays exceed cache size. Numba's CUDA support will also fuse jit functions into a single kernel, or generate device functions.
Sometimes you want manual control over kernel fusion, and I've found the Loopy (https://documen.tician.de/loopy/) to be fairly flexible in this regard, but it's a completely different approach compared to Numba/Julia.
I'd be interested in a side by side comparison as well, and I was thinking that the main difficulty would be that I couldn't write good Julia code, but maybe we can pair up, if that'd be interesting, to address several common topics that come up (fusion, broadcasting, generics but specialization, etc).
> there's nothing in the language that prevents this from working with the autograd package, except no one's taken the time to implement it (https://github.com/HIPS/autograd/issues/47).
I believe it's more complicated than most posters there realize, especially in the context of PyTorch (which uses a fork of autograd under the hood) with its dynamic graphs... Anyway, AD deserves its own discussion, that's I didn't want to concentrate on it.
> I'd be interested in a side by side comparison as well, and I was thinking that the main difficulty would be that I couldn't write good Julia code, but maybe we can pair up, if that'd be interesting, to address several common topics that come up (fusion, broadcasting, generics but specialization, etc).
Sounds good! Do you have a task at hand that would involve all the topics and could be implemented in limited time? Maybe some kind of Monte Carlo simulation or Gibbs sampling to get started?
So we have ~6k rps on a single CPU core. As far as I remember, Tornado has ~4k rps, while built-in Flask server can process only about ~1k requests per second . Yes, you are unlikely to use Flask dev server in production, but for aiohttp it's indeed a recommended way.
Now let's measure Julia's HTTP.jl server:
using HTTP
HTTP.listen() do request::HTTP.Request
return HTTP.Response("Hello")
end
This doesn't include any routing, input data parsing, header or cookie processing, etc., but it amazes me how good the server is given that web development is NOT considered a strong part of the language.
The downside of Julia web programming is the number of libraries and tools (e.g. routers, DB connectors, template engines, etc.) - they exist, but are quite behind Python equivalents, so gotchas are expected. Yet I'm quite positive about future of web programming in Julia.
Some time ago I implemented a library [1] similar to this tool. The tricky part is that derivatives quickly exceed 2 dimensions, e.g. derivative of a vector output w.r.t. input matrix is a 3D tensor (e.g. if `y = f(X)`, you need to find derivative of each `y[i]` w.r.t. each `X[m,n]`), and we don't have a notation for it. Also, often such tensors are very sparse (e.g. for element-wise `log()` derivative is a matrix where only the main diagonal has non-zero values corresponding to derivatives `dy[i]/dx[i]` where `y = log(x)`).
The way I dealt with it is to first translate vectorized expression to so-called Einstein notation [2] - indexed expression with implicit sums over repeated indices. E.g. matrix product `Z = X * Y` may be written in it as:
Z[i,j] = X[i,k] * Y[k,j] # implicitly sum over k
It worked pretty well and I was able to get results in Einstein notation for element-wise functions, matrix multiplication and even convolutions.
Unfortunately, the only way to calculate such expressions efficiently is to convert them back to vectorized notation, and it's not always possible (e.g. because of sparse structure) and very error-prone.
The good news is that if the result of the whole expression is a scalar, all the derivatives will have the same number of dimensions as corresponding inputs. E.g. in:
y = sum(W * X + b)
if `W` is a matrix, then `dy/dW` is also a matrix (without sum it would be a 3D tensor). This is the reason why backpropogation algorithm (and symbolic/automatic differentiation in general) in machine learning works. So finally I ended up with a another library [3], which can only deal with scalar outputs, but is much more stable.
Theoretical description of the method for the first library can be found in [4] (page 1338-1343, caution - 76M) while the set of rule I've derived is in [5].
If I want to export my own computational graph to ONNX, what is the first place I should look at? Do you know about any documentation or reference implementation of the format?
> I don’t understand their vendetta against the graph, which is a powerful abstraction that lets you choose different backend [...]
You don't really need a graph to support different backends. One popular approach is to have different array implementations (e.g. CPU and GPU arrays).
> [...] and let’s tensorboard show you an awesome view of your computation
At the end of the post the author shows his API that lets you do the same things as Tensorboard, but for whatever framework you like.
All in all, expression graphs like these used in TF and Theano are great for symbolic differentiation of a loss function and further expression optimization (e.g. simplification, operation fusion, etc.). But TF goes further and makes everything a node in a graph. Even things that are not algebraic expressions such as variable initialization or objective optimization.
> You don't really need a graph to support different backends. One popular approach is to have different array implementations (e.g. CPU and GPU arrays).
And now you can (waves arms) write it twice! Alternatively, you can make the interfaces between various impls be exactly the same but rename them so they're purpose-named. Then you've written Graph, for the most part.
Except your graph is symbolic, and good luck getting a breakpoint to fire when the calculation is happening ... Or if you don't like debugging, the problem manifests itself with merely printing values too.
BLAS really shines when you do matrix multiplication, for element-wise operations the best you can do is to add numbers using SIMD instructions or put the load to GPU, and most numeric libraries already when possible. The benchmark about seems unrealistic, here are results from my newest MaBook Pro:
In [2]: import numpy as np
In [3]: X = np.ones(1000000000, dtype=np.int)
In [4]: Y = np.ones(1000000000, dtype=np.int)
In [5]: %time X = X + 2.0 * Y
CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
Wall time: 46 s
In [6]: %time X = X + 2 * Y
CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
Wall time: 42.6 s
In [7]: %time X += 2 * Y
CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
Wall time: 37.7 s
In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
Wall time: 42.6 s
No surprise, Julia makes nearly the same result:
julia> X = ones(Int, 1000000000);
julia> Y = ones(Int, 1000000000);
julia> @btime X .= X .+ 2Y
34.814 s (6 allocations: 7.45 GiB)
UPD. Just noticed 7.45Gib allocations. We can get rid of it as:
julia> @btime X .= X .+ 2 .* Y
20.464 s (4 allocations: 96 bytes
or:
julia> @btime X .+= 2 .* Y
20.098 s (4 allocations: 96 bytes)
I could have not noticed use of swap in the previous test, so I repeated it on a Linux box and 1e8 numbers (instead of 1e9). Julia took 100.583ms while Python 207ms (probably due to double reading of the array). So I guess adding 1e9 numbers should take about 1 second on a modern desktop CPU.
I think the benchmark was probably done on a supercomputer. But that's really interesting how well Julia did. I did a basic logistic regression ML implementation in it years ago and I was impressed, but I stopped following its progress. Might have to keep it on my radar!
Even better would be to write (see dot vectorization [1]):
C .= A .+ B
Benchmarks for 3 matrices of size 1000x1000:
julia> using BenchmarkTools
julia> @benchmark C = A + B
BenchmarkTools.Trial:
memory estimate: 7.63 MiB
allocs estimate: 2
--------------
minimum time: 2.359 ms (0.00% GC)
median time: 2.713 ms (0.00% GC)
mean time: 3.794 ms (28.81% GC)
maximum time: 62.708 ms (95.27% GC)
--------------
samples: 1314
evals/sample: 1
julia> @benchmark C .= A .+ B
BenchmarkTools.Trial:
memory estimate: 128 bytes
allocs estimate: 4
--------------
minimum time: 1.232 ms (0.00% GC)
median time: 1.320 ms (0.00% GC)
mean time: 1.356 ms (0.00% GC)
maximum time: 2.572 ms (0.00% GC)
--------------
samples: 3651
evals/sample: 1
Note that memory usage dropped from 7.63MiB to 128 bytes.
A little bit late to the party here, but the number of allocations is really just 0 bytes. It shows 128 bytes because the benchmark is creating new references to A, B and C. To correct this use either interpolation with $A, $B and $C or initialize A, B and C in the setup phase:
> @benchmark C .= A .+ B setup = (A = rand(1000, 1000); B = rand(1000, 1000); C = rand(1000, 1000))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 2.048 ms (0.00% GC)
1) async programming vs. threading
2) infectious async/await syntax
Async programming is great. Coroutines are a powerful tool, both for expressing your ideas more clearly and for improving performance in IO-heavy systems.
async/await syntax may not be the best design for async programming though. Consider example in Julia:
`foo()` returns an asynchronous `Task`, `bar()` awaits this task, and you can invoke `bar()` from whatever context you want. Now look at the Python version with async/await keywords: Oops, we can't make `bar()` synchronous, it MUST be `async` now, as well as all functions that invoke `bar()`. This is what is meant my "infectious" behavior.Maybe we can wrap it into `asyncio.run()` then and stop the async avalance?
Yes, it works in synchronous context. But path to asynchronous context is now closed for us: So in practice, whenever you change one of your functions to `async`, you have to change all its callers up the stack to also be `async`. And it hurts a lot.Can we have asynchronous programming in Python without async/await. Well, prior to Python 3.5 we used generators, so it looks like at least techically it's possible.