mkmccjr's comments

mkmccjr · 2026-03-03T03:40:38 1772509238

Thank you for this! On the second point, you are absolutely correct; that was sloppy writing on my part. I will correct that in the post.

I'm not certain I understand your first point. When I add the type hint, it's me asserting the type, not the compiler proving anything. If the value at runtime isn't actually a byte array, I would expect a ClassCastException.

But I am new to Clojure, and I may well be mistaken about what the compiler is doing.

zahlman · 2026-03-03T04:33:54 1772512434

I mean, I think that probably is what happens. But then, while it's sped up a lot, the generated bytecode presumably also includes instructions to try the cast and raise that exception when it fails.

(And the possible "JIT hotpath optimization" could be something like, at the bottom level, branch-predicting that cast.)

mkmccjr · 2026-03-03T04:47:58 1772513278

Aha, I think I better understand your point: since the generated bytecode includes a cast, my explanation about the optimization is too simplistic.

I haven't actually inspected the emitted bytecode, so I was only reasoning from the observed speedup.

Your point about branch prediction is really interesting; it would explain how the cast becomes almost free once the type is stable in the hot path.

I'm learning a lot from this thread -- thank you for pushing on the details!

EdNutting · 2026-03-03T09:06:02 1772528762

Without seeing the actual differences in the bytecode it will be hard to tell what’s really going on. From my experience with other JITs, I’d expect the situation to be something like:

A) Without the typecast, the compiler can’t prove anything about the type, so it has to assume a fully general type. This creates a very “hard” bytecode sequence in the middle of the hotpath which can’t be inlined or optimised.

B) With the typecast, the compiler can assume the type, and thus only needs to emit type guards (as suggested in this thread). However, I’d expect those type guards to be lifted through the function as far as possible - if possible, the JIT will lift them all the way out of the loop if it can, so they’re only checked once, not on every loop iteration. This enables a much shorter sequence for getting the array length each time around the loop, and ideally avoids doing type/class checks every time.

This would avoid pressuring the branch predictor.

Most JITs have thresholds for “effort” that depend on environment and on how hot a path is measured to be at runtime. The hotter the path, the more effort the JIT will apply to optimising it (usually also expanding the scope of what it tries to optimise). But again, without seeing the assembly code (not just bytecode) of what the three different scenarios produce (unoptimised, optimised-in-test, optimised-in-prod) it would be hard to truly know what’s going on.

At best we can just speculate from experience of what these kinds of compilers do.

imtringued · 2026-03-03T10:56:09 1772535369

It's speculation because the author didn't show the byte code or even just what the code decompiles to in Java.

But even with speculation, it shouldn't be that surprising that dynamic dispatch and reflection [0] are quite expensive compared to a cast and a field access of the length property.

[0] https://bugs.openjdk.org/browse/JDK-8051447

mkmccjr · 2026-03-03T17:05:09 1772557509

These are both great points. When I wrote the post, it didn't occur to me that I could inspect the emitted bytecode. In hindsight, including that would have made the explanation much stronger.

To be honest, this is my first time really digging into performance on a JIT runtime. I learned to code as an astronomy researcher and the training I received from my mentors was "write Python when possible, and C or Fortran when it needs to be fast." Therefore I spent a lot of time writing C, and I didn't appreciate how aggressively something like HotSpot can optimize.

(I don't mean that as a dig against Python; it's simply the mental model I absorbed.)

The realization that I can have really good performance in a high-level language like Clojure is revolutionary for me.

I'm learning a ton from the comments here. Thanks to everyone sharing their knowledge -- it's genuinely appreciated.

zahlman · 2026-03-03T19:34:19 1772566459

> The realization that I can have really good performance in a high-level language like Clojure is revolutionary for me.

I should try it out some time. The Lisp family takes a bit of a mental reset to work with, but I've done it before.

> ...the training I received from my mentors was "write Python when possible, and C or Fortran when it needs to be fast."... (I don't mean that as a dig against Python; it's simply the mental model I absorbed.)

Well, you know, I've been using Python for over 20 years and that really isn't a "dig" at all. The execution of Python is famously hard to optimize even compared to other languages you might expect to be comparable. (Seriously, the current performance of JavaScript engines seems almost magical to me.) PyPy is the "JIT runtime" option there; and you can easily create micro-benchmarks where it beats the pants off the reference (written in C with fairly dumb techniques) Python implementation, but on average the improvement is... well, still pretty good ("On average, PyPy is about 3 times faster than CPython 3.11. We currently support python 3.11 and 2.7"), but shrinking over time, and it's definitely not going to put you in the performance realm of native-compiled languages.

The problem is there's really just too much that can be changed at runtime. If you look at the differences between Python and its competitors like Mojo, and the subsets and variants of Python used for things like Shedskin and Cython (and RPython, used internally for PyPy) you quickly get a sense of it.

mkmccjr · 2026-03-03T00:10:39 1772496639

Author of the blog post here. That explanation sounds very plausible to me!

If the whole enclosing function became inlinable after the reflective call path disappeared, that could explain why the end-to-end speedup under load was even larger than the isolated microbench.

I admit that I don't understand the JIT optimization deeply enough to say that confidently... as I mentioned in the blog post, I was quite flummoxed by the results. I’d genuinely love to learn more.

mkmccjr · 2026-02-15T04:04:06 1771128246

I appreciate the optimism! This specific example is just a pedagogical toy designed to be simple enough to analyze fully.

That said, I do agree with the intuition that static networks have a ceiling. If we want systems that can truly adapt to new contexts (like different hospitals or different physical laws) without retraining, we likely need dynamic architectures.

mkmccjr · 2026-02-15T03:55:24 1771127724

You are absolutely right about the code: I haven't worked with neural networks in a while and I guess my post outs me!

That said, I do like Keras's functional API, and in this case I think it maps nicely to the "math" of the hypernetwork.

I really appreciate your suggestion of more popular libraries, and I'll look into JAX.

mkmccjr · 2026-02-15T03:47:46 1771127266

Thank you for your comment, and I sincerely apologize for my slow response! "Rediscovering structure" is exactly the inefficiency I was trying to highlight.

In the physics/science cases I work with, the factorization is usually between the physical law (shared structure) and the experimental conditions (dataset-specific structure). If you don't separate them, the model wastes capacity trying to memorize the noise of the experimental conditions. (It's ineffective as well as wasteful.)

The analogy to code generation makes a lot of sense: flattening a tree into a sequence forces the model to infer syntax that was already explicit. Thank you for the link; I look forward to diving into it!

mkmccjr · 2026-02-15T03:36:03 1771126563

Thank you for reading my post, and for your thoughtful critique. And I sincerely apologize for my slow response! You are right that there are other ways to inject latent structure, and FiLM is a great example.

I admit the "static embedding" baseline is a bit of a strawman, but I used it to illustrate the specific failure mode of models that can't adapt at inference time.

I then used the Hypernetwork specifically to demonstrate a "dataset-adaptive" architecture as a stepping stone toward the next post in the series. My goal was to show how even a flexible parameter-generating model eventually hits a wall with out-of-sample stability; this sets the stage for the Bayesian Hierarchical approach I cover later on.

I wasn't familiar with the FiLM literature before your comment, but looking at it now, the connection is spot on. Functionally, it seems similar to what I did here: conditioning the network on an external variable. In my case, I wanted to explicitly model the mapping E->θ to see if the network could learn the underlying physics (Planck's law) purely from data.

As for stability, you are right that Hypernetworks can be tricky in high dimensions, but for this low-dimensional scalar problem (4D embedding), I found it converged reliably.

mkmccjr · 2026-02-05T22:30:07 1770330607

Hello. I am the author of the post. The goal of this was to provide a pedagogical example of applying Bayesian hierarchical modeling principles to real world datasets. These datasets often contain inherent structure that is important to explicitly model (eg clinical trials across multiple hospitals). Oftentimes a single model cannot capture this over-dispersion but there is not enough data to split out the results (nor should you).

The idea behind hypernetworks is that they enable Gelman-style partial pooling to explicitly modeling the data generation process while leveraging the flexibility of neural network tooling. I’m curious to read more about your recommendations: their connection to the described problems is not immediately obvious to me but I would be curious to dig a bit deeper.

I agree that hypernetworks have some challenges associated with them due to the fragility of maximum likelihood estimates. In the follow-up post, I dug into how explicit Bayesian sampling addresses these issues.

mkmccjr · 2025-11-14T17:27:55 1763141275

Original author here -- thank you for your thoughtful comment.

You're absolutely right that saying "SQL is useful" isn't exactly novel. My goal with the blog post was to describe the practical impact of leaning into SQL (and DuckDB) at our company.

I'm not the SQL expert on our team (that's my colleague Kian) but I've seen the difference he's made with his expertise. A lot of the work we migrated into SQL was originally implemented as the kind of multi-step pipelines you described: we used multiple libraries, wrote intermediate files, and had to translate data between different formats.

Kian recently rewrote a large stage of our pipeline so it runs entirely inside a single SQL script. It's a complicated script to be sure, but that's because the logic it implements is complex. And with CTEs, temp tables, and DuckDB's higher-order functions, it ended up being dramatically clearer than the original sprawl of code. More importantly, it's self-contained, and easy to inspect. Consolidating the logic into one place made a big difference for us.

And thank you for catching my error about the CPU type. We recently moved from an M2 Ultra servers to M4 machines, and I mistakenly conflated the two when I wrote "M4 Ultra." I've corrected the post.

mkmccjr · 2025-10-29T18:42:54 1761763374

Just tried this out, and my mind is blown: https://platform.sturdystatistics.com/deepdive?fast=0&q=camp...

I did a google search for "camping with dogs" and it organized the results into a set of about ~30 results which span everything I'd want to know on the topic: from safety and policies to products and travel logistics.

Does this work on any type of data?

kianN · 2025-10-29T18:49:27 1761763767

Awesome so glad the result were helpful! What's cool is because it's built on hierarchical Bayesian sampling, it is extremely robust to any input — it just kinda works.