More

mlochbaum · 2025-02-24T16:18:50 1740413930

The queue method is popular, but there's a much faster (branch-free) and in my opinion simpler way, known as the van Herk/Gil-Werman algorithm in image processing. It splits the input into windows and pairs up a backward scan on one window with a forward scan on the next. This works for any associative function. I was very surprised when I learned about it that it's not taught more often (the name's not doing it any favors)! And I wrote a tutorial page on it for my SIMD-oriented language, mostly about vectorizing it which I didn't quite finish writing up, but with what I think is a reasonable presentation in the first part: https://github.com/mlochbaum/Singeli/blob/master/doc/minfilt...

I also found an interesting streaming version here recently: https://signalsmith-audio.co.uk/writing/2022/constant-time-p...

EDIT: On closer inspection, this method is equivalent to the one I described, and not the one I'm used to seeing with queues (that starts my tutorial). The stack-reversing step is what forms a backwards scan. The combination of turning it sequential by taking in one element at a time but then expressing this in functional programming makes for a complicated presentation, I think.

mlochbaum · 2025-02-14T02:18:13 1739499493

So the idea is that you'll eventually arrange n elements into ideally about √n sorted lists, organized so that the endpoints of these lists are also ordered (the "rainbow" shape). A new element goes on the end of one of those lists or a new list; which one is generally decided by binary search, but the index is saved to speed up the next insertion if it's the same. So a natural run might benefit by having adjacenct values inserted to the same list.

Which cases benefit and which don't seems potentially quite erratic. If you have a series of runs that shrink over the course of the array, with increasing start value and decreasing start value, each will become a new sorted list and you end up with O(n) insertion plus a perfect merge. If the runs get wider instead, they'll cross over existing endpoints and end up spattered around before merging.

It feels like the paper keeps trying to spin constant-factor improvements as somehow being asymptotically relevant despite acknowledging that mathematically they aren't, for example the fact that log(√n) is only log(n)/2 "would not reflect the sublinear nature of base array length k" and runs hitting the best case "approximately 50% of the time... reinforces JesseSort's average runtime complexity as somewhere between O(n) and O(n log_2 n)". Powersort provably takes O(n + n log r) worst-case time where r is the number of runs; I expect this algorithm is O(n log n) worst- and average-case.

However, on random arrays the given benchmarks do improve relative to Python default as the length increases, with faster performance starting a little under a million elements. There isn't a lot of investigation into why this is. I'd be interested to know if the number of comparisons is reduced in particular. Perhaps this algorithm is more cache-friendly with a random input, but this wouldn't be reliable as many inputs can cause the number of sorted lists to scale with O(n) so that the binary searches need to work on a large array.

mlochbaum · 2025-02-14T03:02:25 1739502145

The analysis would make a lot more sense if it dropped every big O and computed an expected number of comparisons. Of course, the real issue is that the whole thing relies on an empirically-determined length of √8√n for the number of lists in the rainbow, when the theoretical justification isn't there and clearly the worst case is n/2. I'm not seeing any awareness that the expected number of comparisons has to be at least log_2(n!), and am starting to see some worrying signs that the author thinks averages like n log log n are possible. I'd initially assumed that "somewhere between O(n) and O(n log_2 n)" statement meant in the presence of asymptotically large natural runs, but there's a claim that the runs "need not be of any minimum size" and another that the algorithm can "capitalize on natural runs" that seems to be referring to the benchmarks on "purely random values".

mlochbaum · 2025-02-14T14:33:05 1739543585

Ah, the README prior to commit b98f724 does claim O(n log(1.5 ln(n))) asymptotic runtime. This is based on a flawed analysis assuming rainbow endpoints will be evenly distributed, which the paper presents before describing the problem that each inserted element pushes one of those endpoints away from the middle. Bit of a perpetual motion machine search: initial results showed the promise of this approach, some details got in the way but surely there's some trick to work around them...

Still there's the benchmark. I remembered timsort.txt shows comparison counts relative to the log_2(n!) bound and it's only like 1% worse, so the >30% improvement over Timsort at the high end has to come from some other factor. I'm thinking it is actually cache effects—not because the algorithm has inherently better cache properties, as it's just a merge sort at large scales, but because it copies the array to ints internally rather than using a generic comparison. That would be a poor result, as good comparison sorts compiled for 4-byte data are many times faster than Timsort.

tim-peters · 2025-02-15T20:35:58 1739651758

I want to thank you for your analysis! I'm the "timsort" guy, and I'm asked to look at all sorts of things. The devil is in the details, and I've given up bothering to look unless a paper spells out sufficient details up front. Else it's a pit I'd rather not get sucked into.

As general things, timsort was aimed at exploiting all kinds of pre-existing order, not random lists. The only goal for the latter was to be within reach of CPython's previous highly tuned samplesort implementation (like quicksort on steroids, with much less data movement than mergesorts, but more comparisons).

As you say, for randomly ordered input it gets remarkably close to the information-theoretic lower bound on # of compares. So that's not it.

"Cache effects", maybe, but that's a pit to dig into. I'll note that while the newer "powersort" merge strategy is elegant and provably near-optimal by some relevant measures, in practice it doesn't really appear to run any faster (although cases can be _contrived_ that make it - or the older method! - run faster).

Comparison specialization (to ints) could very well account for it. CPython's general comparison machinery is _very_ expensive, even for what turn out to be native machine ints (which CPython cannot know at compile-time: everything is deduced at runtime).

CPython has since grown new machinery to do a pre-pass over the list, and do a form of cheaper comparison specialization for what turn out to be homogeneous lists of suitable types (including "small enough" ints). That alone gives enough speedup to make some of the original choices sub-optimal. For example, binary insertion sort for short runs is no longer best then. That was aimed at minimizing worst-case # of compares, but as comparisons get cheaper that has less value.

There are also surprises in everything. For example, the worst case for binary insertion sort on most machines was _not_ reverse-ordered input, despite that it requires the most data movement. Instead the worst case was randomly ordered data. Why? Branch prediction. In randomly ordered data, each branch is unpredictable. In reverse-ordered data, "move to the left" is always the test outcome.

Another generality is that Python's sort cares more about speed for shorter lists than huge ones. "Big data" problems are more likely to use, e.g., extensions geared to "big data problems", like numpy for giant arrays of floats. In those contexts more suitable _algorithms_ exist, like radix sorts, or multi-key quicksorts for lists of long strings with long shared prefixes.

Python's niche is more in, e.g., web services, where a great many sorts of shorter lists are common.

Bottom line: there is no one "best" sorting algorithm. I keep an eye out for newer developments, but haven't seen anything better yet for what Python is aiming at. I was, e.g., very impressed by pdqsort - but it's not a stable sort, and that alone makes it a non-starter for general Python use.

mlochbaum · 2025-02-16T00:02:41 1739664161

Whoa, hey Tim! Very much agree on no "best" sorting algorithm. The generic-comparison case is one I keep idly considering, although more focused on long lists than short ones. There's been a lot of improvement lately in sorting small datatypes (arguably kicked off by fluxsort), but really, who has so many 4-byte ints laying around that sorting them is a bottleneck? A point Orson Peters made when presenting glidesort[0] was that lists with many repeated duplicates are common even in generic data, as in ordering customers by city, and quicksort seems to be the best way to deal with such a list. However glidesort/driftsort is still largely oriented towards smaller types, and its mergesort-quicksort hybridization can pretty easily make bad choices, as I described at [1] and especially [2] (slightly surprised I didn't mention binary insertion sort in that first section). So this approach feels like it's still immature to me.

[0] https://archive.fosdem.org/2023/schedule/event/rust_glidesor...

[1] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

[2] https://mlochbaum.github.io/BQN/implementation/primitive/sor...

tim-peters · 2025-02-16T03:12:42 1739675562

You're much more up to date on the details of recent developments than I am. It's faded into a background interest (although a persistent one) for me.

One thing that wasn't clear to me about JesseSort: in what way(z) are Rainbows believed to be an improvement over the simpler scheme used by the older "patience sorting"? They both maintain a collection of sorted sublists merged at the end, both use binary search to find the right sublist to add "the next" array element to, and both excel at "low diversity" inputs (if there are K distinct values, patience sorting will build at most K sublists).

As I recall, full-blown patience sorting isn't "naturally stable". Searching through the JesseSort paper, I didn't find it addressed, and the details were too fuzzy to guess offhand. While this is in no way a principled objection, it just _seemed_ "too complicated" to me. Then again, my intuition for such things has served me well before ;-)

mlochbaum · 2025-02-16T04:04:09 1739678649

I'd forgotten about patience sorting! Not that I was ever exactly familiar with it. JesseSort seems nearly identical, except that it allows insertion to either side of the lists, tries the previous index before doing a binary search, and uses pairwise instead of k-way merge at the end. The expected Θ(√n) list count is the same as well. An obvious advantage of going double-ended is that you get automatic O(n) sorting if the input's an ascending or descending run, instead of just one of those.

I think both algorithms can be made stable; if there's a barrier for patience sort it would be that the merge phase pops off the end of the sorted lists so it's going in the opposite direction as insertion? For JesseSort, an element that's equal to a previous one can only be inserted to the same list, or a more central one (created later). So you need to merge sublists giving priority to the outer ones, and also never append an element to the bottom of a sublist if it's equal to that list's current minimum. Doesn't look like JesseSort as implemented in jessesort_c.pyx does either of these: it merges non-adjacent lists and has a backwards merge by virtue of using < instead of <=, and always inserts to the innermost list possible instead of being strict about the bottom side. Which is understandable as being strict has a major downside. If you get a string of equal values in the bottom half of the range, you'd be stuck putting each one in a new sublist until you get to the midpoint and can create a new one where these values start going to the top.

tim-peters · 2025-02-16T06:03:03 1739685783

Ya, I'm just too old for bottomless pits anymore ;-) Patience sorting is still worth study, for its elegance, simple code, and real-world practicality in the context of solving the longest increasing subsequence problem. As a full-blown sorting algorithm, it only excels at reverse-ordered and "low diversity" inputs, but that's in return for very simple, uniform, and easily understood code.

Of course any of these could be changed to use the powersort merge strategy, if desired. It's elegant and principled. But it also exploits that timsort and powersort keep the _number_ of runx simultaneously active no more than log2(N). Needing to keep track of potentially O(N) sublists simultaneously is potentially burdensome.

In any case, I'm most interested in adaptive sorting, since partial order so very often exists in the real world (as you noted that Orson Peters noted, this is often due to "low diversity" as a consequence of sorting on one field of a complex record - but it's more than just that). Thanks to your good explanations, I'm persuaded that JesseSort isn't much more promising than patience sorting in that specific respect (yes, looks like JesseSort can handle ascending order in linear time too). So thanks again!

zahlman · 2025-02-14T21:17:38 1739567858

>This is based on a flawed analysis assuming rainbow endpoints will be evenly distributed

I guess you know this, but we don't need to do any complicated analysis to disprove such a big-O for a comparison-based sort. It's a well known information-theory result: each comparison can remove at most O(1) entropy, while the initial state has O(n lg n) entropy because that's how log(N!) grows (https://en.wikipedia.org/wiki/Stirling%27s_approximation).

mlochbaum · 2025-02-10T23:12:07 1739229127

Seems easy enough to use a parallel scan if you're willing to accept a little work inefficiency, right? Assign each scanner thread a block, first each one counts/xors how many quotes are in its block, exclusive scan on those (last thread's result is unused), and you have the quoting state at the start of each block. And hopefully that block's still in the core's cache.

Or since newlines in strings should be rare, maybe it works to save the index of every newline and tag it with the parity of preceding quotes in the block. Then you get the true parity once each thread's finished its block and filter with that, which is faster than going back over the block unless there were tons of newlines.

orlp · 2025-02-10T23:50:19 1739231419

Yes, I did already propose (at the office) a parity-agnostic chunker (we only need the number of lines + a splitpoint from the chunker) that can do parallel work and only needs a small moment of synchronization to find out which of the two parities it is to lock in a final answer. There would still be a global serial dependency, but on blocks rather than on bytes.

But we only have a finite amount of time and tons and tons of work, so no one has gotten around to it yet. At least now we know that it might be worthwhile for >= ~32 core machines. PRs welcome :)

mlochbaum · 2025-02-11T00:06:39 1739232399

All right, just threw me off a little that you'd consider speculating or backwards decoding as I wouldn't expect them to be easier, or significantly faster (or maybe you consider parity-independence to be speculation? I can see it).

orlp · 2025-02-11T01:00:55 1739235655

Yes, I meant parity-independence with speculation. Essentially you assume either you are or are not within a string at the start and do your computation based on that assumption, then throw away the result with the unsound assumption. Both assumptions can share most of their computation I believe, so I can understand one might see it from the other perspective where you'd start with calling it parity-independence rather than speculation with shared computation.

dzaima · 2025-02-11T01:04:50 1739235890

There might also be the option of just optimistically assuming that, for points in a file with a sequence of like >4K bytes of proper newlines with proper comma counts in each, that here probably isn't in the middle of a multiline string, and parsing it as such (of course with proper fallback if this turns out false; but you'll at least know that this whole run is in the middle of a multiline string).

Also, if you encounter a double-quote character anywhere with a comma on one side and neither a newline, double-quote nor comma on the other, you immediately know 100% whether it starts or ends a string.

mlochbaum · 2025-02-10T19:36:56 1739216216

I think it's worth considering that application development and GUIs really aren't K's thing. For those, yes, you want to be pretty careful about the concept of a "character", but (as I understand it) in K you're more interested in analyzing numerical data, and string handling is just for slapping together a display or parsing user commands or something. So a method that lets the programmer use array operations that they already have to know instead of learning three different non-array ways to work with strings is practical. Remember potential users are already very concerned about the difficulty of learning the language!

mlochbaum · 2025-02-10T14:46:14 1739198774

You win! Whitney was just out of graduate school at the time, and had worked some with APL at I.P. Sharp but was implementing "object-oriented languages, a lot of different LISPs, Prolog"[0]. Next was the more APL-like A around 1985 and K only in 1992.

[0] https://queue.acm.org/detail.cfm?id=1531242

mlochbaum · 2025-01-27T13:47:23 1737985643

Several comments seem confused about this point: the article is not about manual SIMD, which of course C is perfectly capable of with intrinsics. It's discussing problems in compiling architecture-independent code to SIMD instructions, which in practice C compilers very often fail to do (so, exp having scalar hardware support may force other arithmetic not to be vectorized). An alternative mentioned is array programming, where operations are run one at a time on all the data; these languages serve as a proof of concept that useful programs can be run in a way that uses SIMD nearly all the time it's applicable, but at the cost of having to write every intermediate result to memory. So the hope is that "more fluent methods for compilation" can generate SIMD code without losing the advantages of scalar compilation.

As an array implementer I've thought about the issue a lot and have been meaning to write a full page on it. For now I have some comments at https://mlochbaum.github.io/BQN/implementation/versusc.html#... and the last paragraph of https://mlochbaum.github.io/BQN/implementation/compile/intro....

mlochbaum · 2025-01-27T17:43:46 1737999826

Finding it increasingly funny how many people have come out of the woodworks to "defend" C by offering this or that method of explicit vectorization. What an indictment! C programmers can no longer even conceive of a language that works from a description of what should be done and not which instructions should be used!

mlochbaum · 2025-01-27T15:35:46 1737992146

Got the introduction written at https://mlochbaum.github.io/BQN/implementation/compile/fusio....

brundolf · 2025-01-27T16:24:11 1737995051

Thank you. The title is confusing

pjmlp · 2025-01-28T12:02:54 1738065774

Compiler specific C is capable of, ISO C has no SIMD support of any form.

mlochbaum · 2025-01-27T01:52:12 1737942732

https://beyondloom.com/decker/lil.html

See also https://beyondloom.com/blog/rankingoffruits.html for a nice introduction to the query syntax and its relation to K. (And RodgerTheGreat is John Earnest, the author).

mlochbaum · 2025-01-26T23:10:14 1737933014

ngn/k is mentioned, currently-maintained fork at https://codeberg.org/growler/k.

g939763 · 2025-01-26T23:25:08 1737933908

mentioned twice in fact, thank you

mlochbaum · 2025-01-25T16:14:58 1737821698

Regarding graphics, CBQN is not very suited to doing these natively as it runs on the CPU rather than GPU and has only double-precision and not single-precision floats. So, can you do some simple animation at 60FPS? Probably. Can you make a competitive game engine with it? Definitely not. A few programmers are doing games using Brian's FFI raylib bindings (https://github.com/Brian-ED/rayed-bqn), see in particular David Zwitser's videos on Astroids and Snake at https://www.youtube.com/@davidzwitser/videos. And we have nice FFI docs now: https://mlochbaum.github.io/BQN/doc/ffi.html!

(Everybody seems to assume "high performance" automatically means in whatever domain they work in. Impossible to interpret these sorts of questions. Try "Will BQN have high performance for ...?" instead.)

truckerbill · 2025-01-26T10:26:32 1737887192

Thanks. Yes I was thinking more about the (traditionally) CPU side of graphics or games, things like CPU skinning or skeletal animation.

I was wondering mostly about the FFI overhead. So the question should really be, is it suitable for real-time low-latency contexts? But I have your answer :)

I still would love to find a use for it one day. If there's ever a chance you could go the futhark route and allow for compilation of CPU/GPU routines I think it would be an ideal kind of syntax.

mlochbaum · 2025-01-26T13:10:13 1737897013

There's also a GC pause issue: reference counts are the main method, but when those don't do the job we rely on stop-the-world mark and sweep. And we do very little analysis so it's easy for namespaces and closures to form reference loops. If you're careful about how you use those, it's possible to keep the reference counts working, but debugging when this fails is probably not such a nice experience. Running GC every frame might be acceptable if the stored state has few enough objects; it can be around 1ms if there's not much to do.

mlochbaum · 2025-01-01T13:41:51 1735738911

I have explored it, see https://mlochbaum.github.io/BQN/implementation/primitive/sor....

I implemented this method in Dyalog 18.0 with BlockQuicksort-like partitioning, using vectorized comparison with bit-boolean output. It's faster than you'd expect, maybe twice as slow as regular binary search when searched values are in cache, and better once they fall out of L2. But Dyalog reverted to 17.1 for later versions so you won't see it in a new download. It'll probably make it into CBQN eventually, perhaps with radix partitioning. Note that both quicksort and radix partitioning can be done and undone in a cache-friendly way.

Unlike quicksort, there's no issue of pivot selection since you always choose the midpoint of the searched values. However, there's a complementary issue of memory if the partitions become unbalanced, because the larger partition can require saved memory of roughly the depth times the number of elements. With a bit per comparison it's bounded by the size of the input.

mlochbaum · 2025-01-01T17:19:31 1735751971

Wasn't the best section link, last paragraph here has more detail: https://mlochbaum.github.io/BQN/implementation/primitive/sor...

curiouscoding · 2025-01-03T01:07:34 1735866454

Nice overview of sorting methods, thanks for sharing!

I also looked a bit into radix and distribution sort at some point over the past year, but in the end high performance sorting is actually too big of a thing to just do quickly on the side, as your post well shows :") In fact I wasn't aware of the associativity issues for radix sort. That's definitely something to keep in mind and investigate.

Will definitely refer back to it once I'm looking at sorting again in more detail at some point!

mlochbaum · 2025-01-06T01:15:22 1736126122

I got stuck on sorting too, was working on SingeliSort (https://github.com/mlochbaum/SingeliSort) for a while. The basic performance is there but I need to get serious about testing before using it. But the radix sort and counting sort should be very solid. The approach is about the same as the C code currently used in CBQN, linked below. The main complication is to reduce constant overhead for shorter arrays with a small count type and better prefix sums, interleaved SWAR for SingeliSort since it targets generic architecture and shared SIMD utilities in CBQN.

Email in my Github profile, feel free to contact me any time if you'd like to talk about algorithms!

32-bit radix sort: https://github.com/dzaima/CBQN/blob/v0.8.0/src/builtins/grad... plus https://github.com/dzaima/CBQN/blob/v0.8.0/src/builtins/radi...