More

ffriend · on Nov 14, 2024

There seem to be two distinct topics:

1) async programming vs. threading

2) infectious async/await syntax

Async programming is great. Coroutines are a powerful tool, both for expressing your ideas more clearly and for improving performance in IO-heavy systems.

async/await syntax may not be the best design for async programming though. Consider example in Julia:

  function foo(x)
      @async print(x)    # some IO function
  end
  function bar(x)
      @sync foo(x)
  end

`foo()` returns an asynchronous `Task`, `bar()` awaits this task, and you can invoke `bar()` from whatever context you want. Now look at the Python version with async/await keywords:

  async def foo(x):
      print(x)     # some IO function
  def bar(x):
      await foo(x)
  # SyntaxError: 'await' outside async function

Oops, we can't make `bar()` synchronous, it MUST be `async` now, as well as all functions that invoke `bar()`. This is what is meant my "infectious" behavior.

Maybe we can wrap it into `asyncio.run()` then and stop the async avalance?

  def bar(x):
      asyncio.run(foo(x))
  bar(5)

Yes, it works in synchronous context. But path to asynchronous context is now closed for us:

  async def baz(x):
      bar(x)
  await baz(5)
  # RuntimeError: asyncio.run() cannot be called from a running event loop

So in practice, whenever you change one of your functions to `async`, you have to change all its callers up the stack to also be `async`. And it hurts a lot.

Can we have asynchronous programming in Python without async/await. Well, prior to Python 3.5 we used generators, so it looks like at least techically it's possible.

instig007 · on Nov 14, 2024

> Can we have asynchronous programming in Python without async/await.

Gevent exists: https://sdiehl.github.io/gevent-tutorial/

ffriend · on May 16, 2024

It's also worth mentioning that the original implementation by Meta is only 300 lines of very readable code [1].

[1]: https://github.com/meta-llama/llama3/blob/main/llama/model.p...

ebb_earl_co · on May 16, 2024

On line 59, there is a less-than-or-equals comparison between 0 and 1. Curious https://github.com/meta-llama/llama3/blob/main/llama/model.p...

bloaf · on May 16, 2024

I am a reasonably competent python coder, yet when I see stuff like this I regard it with the same suspicion as a switch in the "more magic" position.

https://www.catb.org/jargon/html/magic-story.html

danielheath · on May 16, 2024

What's the operator precedence in python?

Is it `assert(0 <= (1 < ndim))` or `assert((0 <= 1) < ndim)`, or something even stranger like `assert(0 <= 1) < ndim`?

__s · on May 16, 2024

Python actually does something pretty neat: it chains comparisons so that `x < y <= z` is like `x < y and y <= z` except y is only evaluated once

In linked code we can be confident that `0 <= 1`, so only `1 < ndim` should matter. In fact I'd expect peephole optimization to remove most of the code for `0 <= 1`

3abiton · on May 18, 2024

I did not know that python fact. Thanks for sharing!

blharr · on May 16, 2024

So is this the case that the information is in the data set? Or the code is very well defined to be so small? As an outsider it's surprising that such a capable model can be so "simple".

jacobn · on May 16, 2024

The training code is presumably quite a bit more complex than what they've open sourced, but part of the beauty of the GPT-based LLMs is their structural simplicity.

Now, that simplicity can be deceiving - there are a lot of conceptual interconnectedness within these models. They've been put together "just so" if you will.

If you look at the source code to nanoGPT and compare it to Llama3, the most remarkable thing (when you look past the superficial name changes) is just how similar they are.

If I recall correctly the primary differences are:

  - The MLP: Llama3 uses SwiGLU vs the more "traditional" x = x + proj(gelu(expand(x))) in GPT2
  - The token encoders, which is arguably external to the model
  - Attention: Llama3 uses Grouped Query Attention, vs full Multi-Head Attention in GPT2
  - Normalization: Llama3 uses RMSNorm, vs LayerNorm for GPT2

They were published more than five years apart. On the one hand progress has been breathtaking, truly astounding. On the other hand, it's almost exactly the same model.

Goes to show just how much is in the training data.

jacobn · on May 18, 2024

> Goes to show just how much is in the training data.

And in the scale (num_layers, embed_dim, num_heads) of the model of course ;)

novaRom · on May 17, 2024

> beauty of the GPT-based LLMs is their structural simplicity

human brain's structure is also encoded in a short DNA sequence

jacobn · on May 17, 2024

Forgot one: the positional encoding also changed, llama3 uses RoPE, gpt2 uses a learned embedding.

moritzwarhier · on May 16, 2024

I think with LLMs in general, the algorithms are very refined and require lots of research, despite being "simple" in terms of entropy, or an imagined Kolgomorov complexity for defining algorithms.

So "simple" is a fuzzy term here, but yes, the entropic complexity is in the data, not the algorithms.

Related to the so-called "Bitter lesson".

Edit: the sister comment pointed out what I failed to express: RILHF and training are also algorithms, and their applications and implementations are probably much more complex than the code that evaluates a given prompt.

So basically, "models" (trained NNs) are also an example for the equivalence of code and data.

Fixed data used by code (the trained model) is code in itself, even when it is not directly written by humans or in a human-readable language.

Edit edit: don't forget to count the imported maths code :) but I assume this is not relevant to the "it's just matrix multiplications" overall argument

SpaceManNabs · on May 16, 2024

300 lines of this code is a bit different than 300 lines of typical code where you read files, set up a backend/frontend, or parse data. In the latter case, there are a lot of tedious operations. Sure, the former also has that with reshaping and asserts or wtv.

But in a sense, the 300 lines of Llama code are essentially just lines of math. And reading through any math proof will show you that any particular line can hide large amounts of complexity.

This can be true with code with more tedious operations, but those lines are a smaller fraction of the overall code base by definition.

Even the "tedious" parts of the llama code can hide large complexity. Setting a learning rate with a schedule might require reading a paper or two for your particular architecture.

But yes, once you parse all the math and the theory, the lines are kinda simple matmul and forward lol.

ffriend · on May 16, 2024

Sure, knowing the basics of LLM math is necessary. But it's also _enough_ to know this math to fully grasp the code. There are only 4 concepts - attention, feed-forward net, RMS-normalization and rotary embeddings - organized into a clear structure.

Now compare it to the Hugginface implementation [1]. In addition to the aforementioned concepts, you need to understand the hierarchy of `PreTrainedModel`s, 3 types of attention, 3 types of rotary embeddings, HF's definition of attention mask (which is not the same as mask you read about in transformer tutorials), several types of cache class, dozens of flags to control things like output format or serialization, etc.

It's not that Meta's implementation is good and HF's implementation is bad - they pursue different goals in their own optimal way. But if you just want to learn how the model works, Meta's code base is great.

[1]: https://github.com/huggingface/transformers/blob/main/src/tr...

hongspike · on May 17, 2024

The numpy code can seem more accessible and easy to understand. Torch can look scary even though it's similar to numpy.

kureikain · on May 16, 2024

Do you know why these are so short? What is the algorithm/magic in all of these?

I tried to make sense of it but cannot

Hugsun · on May 16, 2024

Architecturally, LLMs are very simple compared to many software projects.

The crux of their behavior comes from their learned weights which are gigabytes and can cost millions to obtain via training.

DavidSJ · on May 16, 2024

The magic is in the billions of learned weights (~synapses). This is just the scaffolding that runs them.

chpatrick · on May 16, 2024

The magic is the structure of the model, and the real magic is the billions of weights.

_pastel · on May 17, 2024

Why is max_seq_len set to 2048 [1] when the model card says the context size is 8k [2]?

[1] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

[2] https://github.com/meta-llama/llama3/blob/14aab0428d3ec3a959...

mkolodny · on May 17, 2024

That's just the default. You can set max_seq_len to 8k. From the readme [0]:

> All models support sequence length up to 8192 tokens, but we pre-allocate the cache according to max_seq_len and max_batch_size values. So set those according to your hardware.

[0] https://github.com/meta-llama/llama3/tree/14aab0428d3ec3a959...

blt · on May 16, 2024

the simplicity of the transformer is quite refreshing. especially in vision where the Vision Transformer with linear patch encodings replaces complex intertwined decisions about filter size, striding, pooling, #filters, depth, etc., with the simpler decision of how to allocate your FLOPS between dimensionality, #heads, and #layers.

ffriend · on May 16, 2024

JAX requires a bit more work to maintain fixed-size buffers as required by XLA, especially in case of caching and rotary embeddings. But yeah, overall the code can be pretty similar [1].

[1]: https://github.com/dfdx/fabrique/blob/main/fabrique/llama/mo...

ffriend · on Aug 10, 2018

I have quite limited experience with Cython and tried Numba just a couple of times, but I'm curious how much would it take to rewrite one of my Julia libraries to them.

The library is for reverse-mode automatic differentiation, but let's put AD itself aside and talk about code generation. As an input to code generator, I have a computational graph (or "tape") - a list of functions connecting input and intermediate variables. As an output I want a compiled function for CPU/GPU. (Note: Theano used to do exactly this, but that's a separate huge system not relevant no Numba or Cython).

In Julia I follow the following steps:

1. Convert all operations on the tape to expressions (~approx 1 line per operation type).

2. Eliminate common subexpressions.

3. Fuse broadcasting, e.g. rewrite:

    c .= a .* b
    e .= c .+ d

into

    e .= a .* b .+ d

Dot near operations means that they are applied elementwise without creating intermediate arrays. On CPU, Julia compiler / LLVM then generates code that reads and writes to memory exactly once (unlike e.g. what you would get with several separate operations on numpy arrays). On GPU, CUDAnative generates a single CUDA kernel which on my tests is ~1.5 times faster then several separate kernels. Note that `.=` also means that the result of operation is directly written to a (buffered) destination, so it no memory is allocated in the hot loop.

4. Rewrite everything I can into in-place operations. Notably, matrix multiplication `A * B` is replaced with BLAS/CUBLAS alternative.

5. Add to the expression function header, buffers and JIT-compile the result.

In Python, I imagine using `ast` module for code parsing and transformations like common expression elimination (how hard it would be?). Perhaps, Numba can be used to compile Python code to fast CPU and GPU code, but does it fit with AST? Also, do Numba or Cython do optimizations like broadcasting and kernel fusion? I'd love to see side-by-side comparison of capabilities in such a scenario!

marmaduke · on Aug 10, 2018

> In Julia I follow the following steps: … does it fit with AST?

I'm fairly certain the steps you've listed can be accomplished through AST manipulations, and would go something like

    def manip_ast(fn):
        import ast, inspect
        fn_ast = ast.parse(inspect.getsource(fn))
        new_fn_ast = …
        return compile(new_fn_ast, …)

    def rewrite(fn):
        fn = manip_ast(fn)
        fn = numba.jit(fn)

    @rewrite
    def func(*args):
        …

there's nothing in the language that prevents this from working with the autograd package, except no one's taken the time to implement it (https://github.com/HIPS/autograd/issues/47). That said, for many tasks with wide vector data, a DL framework is going to do ok, e.g. PyTorch.

> Julia compiler / LLVM then generates code that reads and writes to memory exactly once (unlike e.g. what you would get with several separate operations on numpy arrays)

Numba's gufuncs address exactly this + broadcasting over arbitrary input shapes. I've used this extensively. That said, I don't find fusing broadcasting is always a win, especially when arrays exceed cache size. Numba's CUDA support will also fuse jit functions into a single kernel, or generate device functions.

Sometimes you want manual control over kernel fusion, and I've found the Loopy (https://documen.tician.de/loopy/) to be fairly flexible in this regard, but it's a completely different approach compared to Numba/Julia.

I'd be interested in a side by side comparison as well, and I was thinking that the main difficulty would be that I couldn't write good Julia code, but maybe we can pair up, if that'd be interesting, to address several common topics that come up (fusion, broadcasting, generics but specialization, etc).

ffriend · on Aug 10, 2018

> there's nothing in the language that prevents this from working with the autograd package, except no one's taken the time to implement it (https://github.com/HIPS/autograd/issues/47).

I believe it's more complicated than most posters there realize, especially in the context of PyTorch (which uses a fork of autograd under the hood) with its dynamic graphs... Anyway, AD deserves its own discussion, that's I didn't want to concentrate on it.

> I'd be interested in a side by side comparison as well, and I was thinking that the main difficulty would be that I couldn't write good Julia code, but maybe we can pair up, if that'd be interesting, to address several common topics that come up (fusion, broadcasting, generics but specialization, etc).

Sounds good! Do you have a task at hand that would involve all the topics and could be implemented in limited time? Maybe some kind of Monte Carlo simulation or Gibbs sampling to get started?

ffriend · on June 1, 2018

Almost certainly yes. Although it doesn't mean it's the best choice for your specific task.

Python had a long history of web server development. I think, aiohttp may be considered state of the art now. Let's measure its performance:

    from aiohttp import web
    
    async def handle(request):    
        return web.Response(text="Hello")
    
    app = web.Application()
    app.add_routes([web.get('/', handle),])
    
    web.run_app(app)

using `wrk` for testing:

    $ wrk -t1 -c1000 -d30s http://127.0.0.1:8080/
    Running 30s test @ http://127.0.0.1:8080/
      1 threads and 1000 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   164.59ms   17.12ms 537.35ms   83.97%
        Req/Sec     6.05k     0.95k    7.74k    74.33%
      180749 requests in 30.08s, 26.72MB read
    Requests/sec:   6008.75
    Transfer/sec:      0.89MB

So we have ~6k rps on a single CPU core. As far as I remember, Tornado has ~4k rps, while built-in Flask server can process only about ~1k requests per second . Yes, you are unlikely to use Flask dev server in production, but for aiohttp it's indeed a recommended way.

Now let's measure Julia's HTTP.jl server:

    using HTTP

    HTTP.listen() do request::HTTP.Request
       return HTTP.Response("Hello")
    end

which gives:

    $ wrk -t1 -c1000 -d30s http://127.0.0.1:8081/
    Running 30s test @ http://127.0.0.1:8081/
      1 threads and 1000 connections
      Thread Stats   Avg      Stdev     Max   +/- Stdev
        Latency   105.95ms  108.35ms   1.99s    98.40%
        Req/Sec     9.65k     1.69k   13.66k    80.21%
      271917 requests in 30.09s, 16.10MB read
      Socket errors: connect 0, read 0, write 0, timeout 274
    Requests/sec:   9035.73
    Transfer/sec:    547.67KB

So it's 9k (with a few failed requests though).

This doesn't include any routing, input data parsing, header or cookie processing, etc., but it amazes me how good the server is given that web development is NOT considered a strong part of the language.

The downside of Julia web programming is the number of libraries and tools (e.g. routers, DB connectors, template engines, etc.) - they exist, but are quite behind Python equivalents, so gotchas are expected. Yet I'm quite positive about future of web programming in Julia.

ffriend · on Nov 18, 2017

Some time ago I implemented a library [1] similar to this tool. The tricky part is that derivatives quickly exceed 2 dimensions, e.g. derivative of a vector output w.r.t. input matrix is a 3D tensor (e.g. if `y = f(X)`, you need to find derivative of each `y[i]` w.r.t. each `X[m,n]`), and we don't have a notation for it. Also, often such tensors are very sparse (e.g. for element-wise `log()` derivative is a matrix where only the main diagonal has non-zero values corresponding to derivatives `dy[i]/dx[i]` where `y = log(x)`).

The way I dealt with it is to first translate vectorized expression to so-called Einstein notation [2] - indexed expression with implicit sums over repeated indices. E.g. matrix product `Z = X * Y` may be written in it as:

    Z[i,j] = X[i,k] * Y[k,j]   # implicitly sum over k

It worked pretty well and I was able to get results in Einstein notation for element-wise functions, matrix multiplication and even convolutions.

Unfortunately, the only way to calculate such expressions efficiently is to convert them back to vectorized notation, and it's not always possible (e.g. because of sparse structure) and very error-prone.

The good news is that if the result of the whole expression is a scalar, all the derivatives will have the same number of dimensions as corresponding inputs. E.g. in:

    y = sum(W * X + b)

if `W` is a matrix, then `dy/dW` is also a matrix (without sum it would be a 3D tensor). This is the reason why backpropogation algorithm (and symbolic/automatic differentiation in general) in machine learning works. So finally I ended up with a another library [3], which can only deal with scalar outputs, but is much more stable.

Theoretical description of the method for the first library can be found in [4] (page 1338-1343, caution - 76M) while the set of rule I've derived is in [5].

[1]: https://github.com/dfdx/XDiff.jl

[2]: https://en.wikipedia.org/wiki/Einstein_notation

[3]: https://github.com/dfdx/XGrad.jl

[4]: http://docs.mipro-proceedings.com/proceedings/mipro_2017_pro...

[5]: https://github.com/dfdx/XDiff.jl/blob/master/src/trules.jl

ffriend · on Oct 8, 2017

If I want to export my own computational graph to ONNX, what is the first place I should look at? Do you know about any documentation or reference implementation of the format?

Marat_Dukhan · on Oct 8, 2017

Reference specification and validator are hosted in https://github.com/onnx/onnx

ffriend · on Oct 8, 2017

> I don’t understand their vendetta against the graph, which is a powerful abstraction that lets you choose different backend [...]

You don't really need a graph to support different backends. One popular approach is to have different array implementations (e.g. CPU and GPU arrays).

> [...] and let’s tensorboard show you an awesome view of your computation

At the end of the post the author shows his API that lets you do the same things as Tensorboard, but for whatever framework you like.

All in all, expression graphs like these used in TF and Theano are great for symbolic differentiation of a loss function and further expression optimization (e.g. simplification, operation fusion, etc.). But TF goes further and makes everything a node in a graph. Even things that are not algebraic expressions such as variable initialization or objective optimization.

KirinDave · on Oct 9, 2017

> You don't really need a graph to support different backends. One popular approach is to have different array implementations (e.g. CPU and GPU arrays).

And now you can (waves arms) write it twice! Alternatively, you can make the interfaces between various impls be exactly the same but rename them so they're purpose-named. Then you've written Graph, for the most part.

candiodari · on Oct 10, 2017

Except your graph is symbolic, and good luck getting a breakpoint to fire when the calculation is happening ... Or if you don't like debugging, the problem manifests itself with merely printing values too.

ffriend · on Aug 29, 2017

BLAS really shines when you do matrix multiplication, for element-wise operations the best you can do is to add numbers using SIMD instructions or put the load to GPU, and most numeric libraries already when possible. The benchmark about seems unrealistic, here are results from my newest MaBook Pro:

    In [2]: import numpy as np

    In [3]: X = np.ones(1000000000, dtype=np.int)

    In [4]: Y = np.ones(1000000000, dtype=np.int)

    In [5]: %time X = X + 2.0 * Y
    CPU times: user 10.4 s, sys: 27.1 s, total: 37.5 s
    Wall time: 46 s

    In [6]: %time X = X + 2 * Y
    CPU times: user 8.66 s, sys: 26 s, total: 34.7 s
    Wall time: 42.6 s

    In [7]: %time X += 2 * Y
    CPU times: user 8.58 s, sys: 23.2 s, total: 31.8 s
    Wall time: 37.7 s

    In [8]: %time np.add(X, Y, out=X); np.add(X, Y, out=X)
    CPU times: user 11.3 s, sys: 25.6 s, total: 36.9 s
    Wall time: 42.6 s

No surprise, Julia makes nearly the same result:

    julia> X = ones(Int, 1000000000);
    julia> Y = ones(Int, 1000000000); 

    julia> @btime X .= X .+ 2Y
      34.814 s (6 allocations: 7.45 GiB)

UPD. Just noticed 7.45Gib allocations. We can get rid of it as:

    julia> @btime X .= X .+ 2 .* Y
      20.464 s (4 allocations: 96 bytes

or:

    julia> @btime X .+= 2 .* Y
      20.098 s (4 allocations: 96 bytes)

ffriend · on Aug 29, 2017

I could have not noticed use of swap in the previous test, so I repeated it on a Linux box and 1e8 numbers (instead of 1e9). Julia took 100.583ms while Python 207ms (probably due to double reading of the array). So I guess adding 1e9 numbers should take about 1 second on a modern desktop CPU.

shepardrtc · on Aug 29, 2017

I think the benchmark was probably done on a supercomputer. But that's really interesting how well Julia did. I did a basic logistic regression ML implementation in it years ago and I was impressed, but I stopped following its progress. Might have to keep it on my radar!

ffriend · on Aug 29, 2017

Even better would be to write (see dot vectorization [1]):

    C .= A .+ B

Benchmarks for 3 matrices of size 1000x1000:

    julia> using BenchmarkTools

    julia> @benchmark C = A + B
    BenchmarkTools.Trial: 
      memory estimate:  7.63 MiB
      allocs estimate:  2
      --------------
      minimum time:     2.359 ms (0.00% GC)
      median time:      2.713 ms (0.00% GC)
      mean time:        3.794 ms (28.81% GC)
      maximum time:     62.708 ms (95.27% GC)
      --------------
      samples:          1314
      evals/sample:     1

    julia> @benchmark C .= A .+ B
    BenchmarkTools.Trial: 
      memory estimate:  128 bytes
      allocs estimate:  4
      --------------
      minimum time:     1.232 ms (0.00% GC)
      median time:      1.320 ms (0.00% GC)
      mean time:        1.356 ms (0.00% GC)
      maximum time:     2.572 ms (0.00% GC)
      --------------
      samples:          3651
      evals/sample:     1

Note that memory usage dropped from 7.63MiB to 128 bytes.

[1]: https://docs.julialang.org/en/stable/manual/functions/#man-v...

vanderZwan · on Aug 29, 2017

Thanks! Like I said: I never truly dove into it, although I loved reading about the approach to the type system, and the multiple dispatch.

> Note that memory usage dropped from 7.63MiB to 128 bytes.

Which is important if you're working with large data-sets. Both for performance and for being able to run the calculations at all.

stabbles · on Aug 30, 2017

A little bit late to the party here, but the number of allocations is really just 0 bytes. It shows 128 bytes because the benchmark is creating new references to A, B and C. To correct this use either interpolation with $A, $B and $C or initialize A, B and C in the setup phase:

    > @benchmark C .= A .+ B setup = (A = rand(1000, 1000); B = rand(1000, 1000); C = rand(1000, 1000))
    BenchmarkTools.Trial:
      memory estimate:  0 bytes
      allocs estimate:  0
      --------------
      minimum time:     2.048 ms (0.00% GC)

This is showing 0 bytes indeed.