More

mlochbaum · on Jan 26, 2025

ngn/k is mentioned, currently-maintained fork at https://codeberg.org/growler/k.

g939763 · on Jan 26, 2025

mentioned twice in fact, thank you

mlochbaum · on Jan 25, 2025

Regarding graphics, CBQN is not very suited to doing these natively as it runs on the CPU rather than GPU and has only double-precision and not single-precision floats. So, can you do some simple animation at 60FPS? Probably. Can you make a competitive game engine with it? Definitely not. A few programmers are doing games using Brian's FFI raylib bindings (https://github.com/Brian-ED/rayed-bqn), see in particular David Zwitser's videos on Astroids and Snake at https://www.youtube.com/@davidzwitser/videos. And we have nice FFI docs now: https://mlochbaum.github.io/BQN/doc/ffi.html!

(Everybody seems to assume "high performance" automatically means in whatever domain they work in. Impossible to interpret these sorts of questions. Try "Will BQN have high performance for ...?" instead.)

truckerbill · on Jan 26, 2025

Thanks. Yes I was thinking more about the (traditionally) CPU side of graphics or games, things like CPU skinning or skeletal animation.

I was wondering mostly about the FFI overhead. So the question should really be, is it suitable for real-time low-latency contexts? But I have your answer :)

I still would love to find a use for it one day. If there's ever a chance you could go the futhark route and allow for compilation of CPU/GPU routines I think it would be an ideal kind of syntax.

mlochbaum · on Jan 26, 2025

There's also a GC pause issue: reference counts are the main method, but when those don't do the job we rely on stop-the-world mark and sweep. And we do very little analysis so it's easy for namespaces and closures to form reference loops. If you're careful about how you use those, it's possible to keep the reference counts working, but debugging when this fails is probably not such a nice experience. Running GC every frame might be acceptable if the stored state has few enough objects; it can be around 1ms if there's not much to do.

mlochbaum · on Jan 1, 2025

I have explored it, see https://mlochbaum.github.io/BQN/implementation/primitive/sor....

I implemented this method in Dyalog 18.0 with BlockQuicksort-like partitioning, using vectorized comparison with bit-boolean output. It's faster than you'd expect, maybe twice as slow as regular binary search when searched values are in cache, and better once they fall out of L2. But Dyalog reverted to 17.1 for later versions so you won't see it in a new download. It'll probably make it into CBQN eventually, perhaps with radix partitioning. Note that both quicksort and radix partitioning can be done and undone in a cache-friendly way.

Unlike quicksort, there's no issue of pivot selection since you always choose the midpoint of the searched values. However, there's a complementary issue of memory if the partitions become unbalanced, because the larger partition can require saved memory of roughly the depth times the number of elements. With a bit per comparison it's bounded by the size of the input.

mlochbaum · on Jan 1, 2025

Wasn't the best section link, last paragraph here has more detail: https://mlochbaum.github.io/BQN/implementation/primitive/sor...

curiouscoding · on Jan 3, 2025

Nice overview of sorting methods, thanks for sharing!

I also looked a bit into radix and distribution sort at some point over the past year, but in the end high performance sorting is actually too big of a thing to just do quickly on the side, as your post well shows :") In fact I wasn't aware of the associativity issues for radix sort. That's definitely something to keep in mind and investigate.

Will definitely refer back to it once I'm looking at sorting again in more detail at some point!

mlochbaum · on Jan 6, 2025

I got stuck on sorting too, was working on SingeliSort (https://github.com/mlochbaum/SingeliSort) for a while. The basic performance is there but I need to get serious about testing before using it. But the radix sort and counting sort should be very solid. The approach is about the same as the C code currently used in CBQN, linked below. The main complication is to reduce constant overhead for shorter arrays with a small count type and better prefix sums, interleaved SWAR for SingeliSort since it targets generic architecture and shared SIMD utilities in CBQN.

Email in my Github profile, feel free to contact me any time if you'd like to talk about algorithms!

32-bit radix sort: https://github.com/dzaima/CBQN/blob/v0.8.0/src/builtins/grad... plus https://github.com/dzaima/CBQN/blob/v0.8.0/src/builtins/radi...

mlochbaum · on Dec 21, 2024

We implemented something like this in CBQN last year (mainly for modulus, as floor division isn't a primitive). Commit is https://github.com/dzaima/CBQN/commit/d333902, some proofs of when and why it works at https://mlochbaum.github.io/BQN/implementation/primitive/ari....

mlochbaum · on Dec 19, 2024

My own take on relating scales geometrically: https://mlochbaum.github.io/BQN-Musician/theory/modulation.h...

It does seem that I include all Chapman's scales (while saying nothing about chords), although oddly enough he's chosen to use the modes of harmonic major but not those of its inversion, harmonic minor?

Edit: In fact I found the second link (first one's pretty vague and wasn't enough for me to follow the diagram) relevant enough that I added a paragraph to point out the connection!

mlochbaum · on Dec 5, 2024

Array programming seems adaptable to any form of parallel processing: SIMD instructions, multi-core, or GPUs. I think it's because, in addition to their inherent parallelism, primitives are so simple that they make very few demands on an implementation. GPU code can be a lot more flexible than array programming because the processors have a bit of independence, so array primitives are not necessarily the best way to use a GPU, but they're pretty good and the limitations might make it easier for a human to think about.

mlochbaum · on Dec 2, 2024

APL did catch on to some extent, see https://news.ycombinator.com/item?id=39471718 .

Without getting into any discussion of the array paradigm itself, the reason commercial programming is such a winner-take-all system now is hiring and organizational difficulties, not that popular languages are so much more productive than unpopular ones. Building a working application is the easy part of running a business.

mlochbaum · on Dec 2, 2024

Memory traffic is certainly the biggest problem that the array paradigm presents for implementation, yes. I'd quibble with calling that "poor locality": when working with large arrays, if any part of a cache line is accessed it's very likely the entire line will be used in short order, which is the definition of good locality as I understand it. The issue is simply high memory use and a large number of accesses.

I think it's an oversimplification to say SPMD is a better model full stop. Array programming has the advantage that the implementer optimizes a specific function, and its interaction with the rest of the program is much simpler: all the data is available at the start, and the result can be produced in any order. So things like multi-pass algorithms and temporary lookup tables are possible, particularly valuable if the program isn't spending that much time on arithmetic but rather "heavy" searching and sorting operations. Not that I claim NumPy or CuPy do a good job of this. Sections "Fusion versus fission" and "Dynamic versus static" are relevant here, I think: https://mlochbaum.github.io/BQN/implementation/versusc.html#...

mlochbaum · on Oct 6, 2024

It's in several, particularly newer APL dialects; see https://aplwiki.com/wiki/Under#History . Proud to say I originated the "structural" form used by Uiua, which is able to deal with transformations like filtering that lose parts of the input. Every language now seems to have its own take on what exactly Under means, with different implementations leading to different supported functions and various ways to relax the theory to be more convenient or handle intuitively expected things better.

mlochbaum · on Oct 6, 2024

Is this because all the code you see is through HN or similar? No one's going to share something titled "an unremarkable script I use to help run my business" here. Not sure what your threshold for code golf is, but you can see APL written in a variety of styles by searching Github. It doesn't recognize K but does have Q, which is basically K plus keywords, obviously promoting more verbose code. Whitney originated the dense style of implementation shown at your link, and a few other implementers (including myself in the past) have picked it up, but it's not that common. For example April, GNU APL, Kap, Goal, and Uiua all use an idiomatic style for their implementation languages.

APL: https://github.com/search?type=code&q=language%3AAPL

Q: https://github.com/search?type=code&q=language%3Aq

Implementation: https://aplwiki.com/wiki/List_of_open-source_array_languages