This is of course not a new invention. The earliest instance I could find with a bit of searching was from 2004, with Andrew Morton mentioning in it a code review so casually that it seems to have been a well established trick. But the vast majority of implementations I looked at do not do this.
I was doing this in 1992 so it's at least 12 years older than the 2004 implementation. I suspect it was being done long before that. Back then the read and write indexes were being updated by separate processors (even more fun, processors with different endianness) with no locking. The only assumption being made was that updates to the read/write pointers were atomic (in this case 'atomic' meant that the two bytes that made up a word, counters were 16 bits, were written in atomically). Comically, on one piece of hardware this was not the case and I spent many hours inside the old Apollo works outside Boston with an ICE and a bunch of logic analyzers figuring out what the hell was happening on some weird EISA bus add on to some HP workstation.
It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.
Edit: having had this discussion I've realized that Juho's implementation is different from the 1992 implementation I was using because he doesn't ever reset the read/write indexes. Oops.
> It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.
The cost of a mask can probably be entirely buried in the instruction pipeline, so that it's hardly any more expensive than whatever it costs just to move from one register to another.
Modulo requires division. Division requires a hardware algorithm that iterates, consuming multiple cycles (pipeline stall).
You do not need modulo or division to implement non-power-of-2 ring buffers. Because you will only increment by one. So instead of "x = x % BufferSize" you can do "if (x >= BufferSize) x -= BufferSize;" or similar.
That's for "normal" ring buffers. I suspect that the design described in the article can be implemented for non power-of-two without division but I'll need to think about the details.
You could also do a predicated conditional move instead. Just do the subtraction every time, and use something like cmov to only do the write if you need to.
I don't know if it would end up being faster, though.
Most likely (very very likely) the branch would be faster. It will almost always be predicted correctly (exceptions on the rollover) and cmov can be moderately expensive.
The general rule is to only use cmov if your test condition is mostly random.
I don't think this is true. I tried the sample out, and gcc, clang, and intel's compiler all generate a cmov for the code instead of a branch with -O2. I don't think all these compilers would have used a cmov instead of a branch if the cmov was more expensive than a branch in this case.
It might have something to do with the branch not being part of a loop so the best it can do is assume the branch is random (think something like modding a hash code when it would indeed be random).
Ad was pointed out in the thread, recent cpus have reduced the latency of cmov to a cycle. So your result could also depend on your architecture.
It does, but it may run out of non-dependent instructions to execute while waiting for the long pole of the cmov dependency chain to finish.
This used to be a problem on Pentium4 where cmov had high latency (4 cycles or more), but today, IIRC, a register-to-register cmov is only one cycle, so it is safe to use whenever a branch could have a non-trivial misprediction rate.
Then you have to deal with branch mispredictions which may hurt performance pretty bad if the RB is heavily trafficked (which often is the use case for an RB).
Actually it might not really be mispredicted. Slightly older Intel CPUs had a dedicated loop predictor that would exactly predict quite complicated taken/non-taken sequences. If the RB is of fixed size the edge case would be always perfectly predicted.
More recent CPUs, IIRC, do away with the dedicated loop predictor as they have a much more sophisticated general predictor, which, although won't guarantee perfect prediction on this case, it might still get close enough.
Depends heavily on the size of the buffer. If it's only three elements large, then branch overhead will be measurable. But for larger buffers, it likely will always be predicted as not taken, and you only have a branch miss upon wraparound.
Modulo by a constant doesn't require a division, you can instead use multiplications, shifts, adds and subtracts. This transform is typical for compilers. For example, this is what gcc targetting x86_64 does to perform % 17 on the unsigned value in %edi:
That only works for compile time constants, though. With a power of two sized buffer you can just store the mask and decide how large you want your buffer to be at runtime.
The real problem with modulo is at the end of the integer range. Add one and overflow and suddenly jump to a totally different index in the array!
BTW: read and write pointers, power of two, did that in BeOS (1998) and many sound drivers did it earlier than that.
To me, that seemed like the obvious way to do it when I needed it.
One of the benefits of the original algorithm is the independence of the read and write indexes, they can be updated from different threads (or different processors!) without any atomic operations beyond writing or reading a value. Subtracting from both pointers requires an additional atomic read/modify/write operation.
You could also just restrict the pointers in the normal way but to two times the size of the buffer. So instead of wrapping at N you wrap at 2*N.
You are only encoding 1 bit of data (first or second) so adding more data than that by allowing unsigned integer overflow is just an optimization, not fundamentally necessary.
If you do that then the size() function becomes a problem. The original implementation relies on unsigned integer wrap-around to give the proper result when write < read.
It all depends on your use case. In my experience, most of the times when dealing with RB:s, it's more important to guarantee that the read index and the write index can be updated lock-free by different threads (e.g. one producer and one consumer) than to have a very specific non-power-of-2 capacity.
No, it's not. A non-power-of-two will lead to incorrect behavior, implementing the data structure the way the author describes, for precisely the reason given by your parent comment. There are other implementations that don't suffer from this (described in the comments there), but it's not simply a matter of replacing bitwise-and with a more generic computation of the modulus.
Imagine we had four bit integers and a three cell array. Stepping from 7 (= 1 mod 3) to 8 (= 2 mod 3) winds up stepping instead to 0 (= 0 mod 3) because overflow, which would reuse cells inappropriately.
Yes, any choice. Yes, that's not the case if you change the implementation in other ways in tandem - but then you are describing a different implementation. To mask where he does, any choice of mask will exhibit this problem (or other problems), mathematically. I can produce a proof if you need it.
> He only ever calls mask() after an increment.
I think you misunderstand what he's talking about. He calls mask at insert and lookup and he does not store the masked value - which leaves the implementation vulnerable to overflow (which is not a problem iff the array is 2^n big).
That said, one of the comments on the blog actually had a great suggestion for dealing with this. Wrap the value yourself at 2*capacity, you'll still get the same benefits from the algorithm and it's trivial to prevent the overflow then. You can then avoid the modulo operation (subtract the capacity if it's a larger index) and get better performing non-power of 2 capacities.
That said I'd mostly be using this kind of thing in a situation where i'd want to have a power of two sized buffer anyway.
Conceptually it's roughly what's going on anyway - something must wrap at some point if we're going to store our offsets in limited space - just that we get the wrapping for free from overflow if it's 2^n.
The confusion above stemmed, I think, from the fact that in the "original" implementation the mask is used for that wrapping and then we have a noop projection from offset to index. In this implementation, overflow is used for that wrapping and the mask is to project from offset to index. In the implementation you discuss, we pick still other functions for both.
fastest would likely be to wrap at the greatest multiple of N that fits in your integer representation, since that (probably dramatically) reduces branch mispredicts.
However, you're still doing lots of modulos to then find the "real" array index from the "virtual" one, so this is still likely not a great option compared to the power-of-two buffers.
It might actually be faster (at least for mildly large buffers) to use 2 ifs and a range that's twice the capacity. One ifs checks whether your virtual index needs to subtract the capacity to become real, and another checks whether it's time to substract 2 times the capacity.
With the conditional subtraction recommended elsewhere, you can do 2*N with no branch mispredictions and no modulo. Whether that's actually faster should be tested, if you're in an environment where you care about such things.
You might have misread the OP. He's not asking why using & for the mask requires a 2^n sized buffer. He's asking why bother using & when it imposes these additional constraints on us. A part of the answer is that the constraints are already there with this approach, even if you pick a different function for masking than f(i,n) = i & (n - 1) -- which point OP only recently understood due to a misreading of the article.
Oh, come on, you don't need a branch to do a conditional subtraction. Reify the condition to 0/1 and use multiplication, or use AND with a two's complement of the condition.
OK, my bit twiddling knowledge is weak, my google skills are weaker still, and now I'm curious: what does "AND with a two's complement of the condition" mean, exactly?
I fully believe that there are plenty of contexts where that's true - particularly in any throughput oriented system where the buffer is large. But if you care, measure.
For what it's worth, Seymour Cray's Control Data 6600 implemented ring buffers for I/O channels correctly in hardware using gumdrop-sized single transistors in 1963. This is not exactly a new technology.
It's odd that you were using ring buffers in 1992 for low level code but don't understand the value of avoiding a modulus instruction. Masking is far more efficient and often a ring buffer will be used in code where performance is absolutely critical.
You wouldn't use the modulus operation. You aren't adding some arbitrary number that's going to make you increase either index by more than the buffer length so you know that at worse you are going to need to subtract the length of the buffer.
IIRC the way we made this really fast was the write the buffer backwards. That way you can detect wrapping around the buffer because DEC will underflow and set the sign flag. Then you can JS to whatever code needs to ADD back the buffer length to handle the wrap around.
But 2^n has another problem (back in that era): buffer size. You are stuck with 1K, 2K, 4K, etc. buffers. When memory is tight you likely need something very specific, so you end up with the solution we had.
But, hey, if memory is free use 2^n bytes for your buffer.
You are still introducing a conditional by detecting the need to subtract, and iterating backward through memory is horrific for cache performance. If you need a specific, non power-of-2 sized buffer, then of course you make that design decision and pay the performance penalty. But I restate it's odd that you weren't even aware of the cost in 1992 as a system level programmer.
>iterating backward through memory is horrific for cache performance
This isn't true for Intel chips since Netburst Pentium 4. The hardware prefetchers can handle predicting iterating through an array forwards, backwards, and even strided accesses [0]. The arrays takes up the same number of cache lines in both cases, so going forwards or backwards are still going to have the same number of cache misses.
You don't need a conditional. You can set up a mask using sbb.
; precondition: 0 <= x <= N
; (N is constant)
mov y, 0 ; set up mask
cmp x, N-1 ; set carry flag if x >= N
sbb y, 0 ; subtract 1 from y if carry flag set
and x, y ; set x to zero if x == N
Have you tested this recently? I haven't for some years now but performance was identical regardless of direction. Maybe I need to try it again. I'd expect going backwards to be no worse than "not as good" - like say perhaps the prefetching mechanism doesn't cater for this case - but maybe my standards aren't high enough and this is enough to tip things over into the horrific.
I preach this knot to everyone I can. I'm a runner and a running coach. I've run literally thousands of miles (approaching 10,000 at this point) with this knot and it has NEVER come undone.
The really nice thing about this knot is that it looks really nice too so you can use them on both running shoes and dress shoes.
It makes no sense to teach the more common shoe tying knots.
"The finished "Ian Knot" is identical to either the Standard Shoelace Knot or the Two Loop Shoelace Knot. Because it was tied much more quickly and symmetrically, the laces suffer less wear and tear and thus last longer."
I've had a shoelaces break maybe 3-4 times on shoes I wore regularly for more than 2-3 years. It's annoying out of all proportion to the expense involved.
(The plastic bits at the end can also get frayed and fall off, which happens more quickly, but I'm not sure knot style has much to do with that.)
I think it's so annoying because of the timing. I've never had one break when untying the knot or when just walking around. It's always while tying it which means I was just about to leave and now life has thrown a monkey wrench into my plans. Depending on how close I am cutting things, this may be an event that makes me late. Grrrrrr. Stupid shoelace!
Yes, this! And this happens particularly often, when you use standard cotton laces which make knots harder to accidentally undo. The synthetic ones last much longer, but are slippery and easy to untie.
If the plastic bit at the end falls off, just cut off the frayed part and dip the end into molten wax from a candle. I can't say I've tried it yet though, even that's so much trouble that I just live with the frayed end.
Heat-shrink tubing is perfect for replacing shoelace ends, if you happen to have some lying around (or have a friend who tinkers with electronics you can blag a bit off).
As a kid I wore canvas sneakers most of the time. I laced them every day. The shoes would outlast the laces even though eventually I would outgrow the shoes. Since I didn't have a personal assistant to get me new laces, I often had to tie the shoes differently so that the laces would still work in some fashion. On high top sneakers, sometimes I'd lace them approximately as low top sneakers, but with a really economical knot. The main point of wear was the point where the lace went through the top eyelets.
Yeap - a bit of my shoe lace broke just after tying them once on a work morning. I had to run to catch a bus, but instead I stood on my shoe lace mid stride, fell and slid across a petrol station driveway. Spent the bus ride trying not to bleed on the seats and had to apply disinfectant and remove stones from the flesh wound at work.
Anyway it was embarrassing but it taught me a valuable lesson - shoe laces can wear out and break.
Yeah, it happens to shoes that are kept a long time.
However, I remain unconvinced that the wear pattern matters this much. It seems to me that an alternative would be to re-lace your shoes every year, flipping sides. Then the pattern would be more even, too. And probably still a waste of time and effort.
No, the Ian's Secure Shoelace Knot is different. If you look at the very final photographs of both, you'll see that on the regular knot (tied either the standard way or the fast way) there's only a single vertical piece of lace right at the very front, but in the secure knot there are two.
Try the secure knot. It's basically like the bunny-ears way of tying a regular knot (where you hold both bunny-ears and slip one under the other), but you leave the hole open and slip the second back under the first as well.
It's much more secure than a double-knot, in my experience, and looks a lot nicer. But I still can't instinctually do it -- it takes me an extra couple of seconds each time.
If you pull the loops of a standard shoelace knot, you wind up with a square knot. Do the same with Ian's secure shoelace knot and you wind up with a surgeon's knot.
This is the double slip knot, I think. I have recently started using it (the standard knot is going loose too fast the way I wear my shoes) and will never go back to the standard knot. Just so good.
These look like variations on the square knot. How is he the inventor? I was looking at a field scout manual dated 1948 the other night where they have this same knot.
More importantly, make sure your starting knot and main knot are correct with respect to each other. When I learnt the Ian knot, I later learnt that I'd been tying my shoes using a "granny knot": http://www.fieggen.com/shoelace/grannyknot.htm If you are doing this, the easiest thing to do is reverse your starting knot; relearning the main knot is going to be much harder.
Since learning the Ian knot (and correct starting knot) I can honestly say I enjoy tying my shoes every day and relish the opportunity to tie a bow at any other time.
Wow, after reading that page I realized that I did this all throughout elementary school and middle school, which explains why my shoelaces would come undone all the time.
I use the "Two Loop Shoelace Knot Bad Technique 1" from that page.
In recent years I've been wearing shoes with a different fastening mechanism, but I have to tie some dress shoes for a wedding tomorrow, so this is very timely knowledge!
I wasn't given the attribution when I heard about this, it's nice to see the face behind it. Not sure I can even tie my shoes the old way anymore, I've never done it once since learning Ian's knot. Every once in a while someone observant will see me doing it and say, "whoa, WHAT?" I do get a kick out of telling people I re-learned how to tie my shoes on the internet.
It seems to be almost the same as the traditional method to me, since the crossing of the laces seems to cost about 50% of the time. It gets a lot better when you leave the crossing in. Edit: Apparently, there also is a routine to get the crossing in a nice, fast way, it just wasn't included in the pictures :)
More importantly, I can't seem to get a Ian's knot very tight. Does this get better over time?
I prefer the double slipknot (mentioned below as Ian's Secure Knot). I started doing it for basketball, but it not only looks better (more even) on normal, dress shoes, but it neves come off on its own (but is easy to pull apart voluntarily). Also, it's not more complicated than a normal knot, it's more or less doing it "twice, in reverse".
So many people walk around assuming they need to do complicated double knots to stop their shoelaces untying themselves. If only they knew they were doing Granny Knots, and that a standard knot is perfectly secure if tied properly.
I love the way this discussion has divided neatly into thirds: history of ringbuffers; digression on shoelaces; fragmentary, widely ignored, replies about everything else (this one included, I'm sure).
I like this kind of article and enjoyed this particular one, but the long discussion above about the "right" way to do it goes some way to justifying why so many people are happy to do it the "wrong" way.
I've implemented and used ring buffers the "wrong" way many times (with the modulus operator as well!) and the limitations of this method have never been a problem or bottleneck for me, while its simplicity means that it's easier to write and understand than almost any other data structure.
In most practical applications, it's memory barriers that you really have to worry about.
I was waiting for someone to mention this -- it seemed much more interesting to me. It's a real classic in the "what the hell, you can do that?" category. (Bonus points if you've done it in a language that requires "extra data" for strings, like storing the length somewhere.)
I must admit that I never actually benchmarked my implementation properly -- it might be interesting to see if there are actual trade-offs between mmap vs. copying. (I'm guessing that nothing can beat MMU support, but I think the MMU also supports copy operations, so...?)
That's so cool. Unfortunately for me, the one time I could have used something like this, I was working on an embedded system with no mmap / virtual memory.
Note that wake_up() does not guarantee any sort of barrier unless something
is actually awakened. We therefore cannot rely on it for ordering. However,
there is always one element of the array left empty. Therefore, the
producer must produce two elements before it could possibly corrupt the
element currently being read by the consumer. Therefore, the unlock-lock
pair between consecutive invocations of the consumer provides the necessary
ordering between the read of the index indicating that the consumer has
vacated a given element and the write by the producer to that same element.
I have always considered these "double ring" buffers. Along the same lines as how you figure out which race car is in the race is in lead by their position and lap count. You run your indexes in the range 0 .. (2 * SIZE) and then empty is
Basically you're full if you're at the same relative index and your on different laps, you are empty if you at the same relative index on the same lap. If you do this with power of 2 size then the 'lap' is just the bit 2 << SIZE.
No, I think the author is using the full range of a 32 bit int. So read could be any 32 bit integer, even if the size of the ring is 1.
(The trick is that SIZE has to be a power of two, or else when you increment from 2^32-1 to 0, your pointers will jump to a different position in the array.)
The first version always leaves a "clean" state, that is both indices points to actual array locations. A mentally "clean" state makes understanding easier. For the third version one has to keep in mind the wrap around behavior of computer specific integers throughout the comprehension process, so it is a bit more difficult (to understand).
The reasoning comes down to how you use it. I use ringbuffers for ultra low latency buffering of market data for instance. If my ringbuffer is so full that I'm worried about its length approaching its capacity then I'm doing something wrong and I should be willing to lose the data. 1 element isn't going to make the difference.
The real reason to stick with the first approach is that your static analysis tools won't freak out that you have intentional unsigned int overflow. Heck, some compilers will now scream at you for doing this. Then what happens when someone goes to port your code to a language with stricter overflow behavior? It won't work.
IMO even in realtime systems, I don't use this. Heck, the linux kernel even uses the original version.
And you don't have expend any mental energy on the integer overflow edge case. It should be handled by using a bitmask and a power-of-2 sized array, should.
Usually when I'm writing a ring buffer, it's for tasks where the loss of an item is acceptable (even desirable - a destructive ring buffer for debugging messages is a fantastic tool). As such, I simply push the read indicator when I get to the r=1, w=1 case.
Using the mask method is slick (I'd cache that mask with the array to reduce runtime calculations), but it's definitely going to add cognitive overhead and get messy if you want to make it lockless with CAS semantics.
In general, this makes sense; certainly data you're putting into a ring buffer is data you're willing to lose.
Doesn't it break the order invariant of the buffer, though? I can't see a way to do this without the risk of getting reads of newer data prior to older data. That's probably fine in many cases, but something like non-timestamped-debugging strikes me as a case where I'd want to know that the data arrived in the order I'm seeing.
> Doesn't it break the order invariant of the buffer, though
No, if you increment the read pointer prior to the write pointer, the read pointer will still point at the oldest valid value in the buffer.
So, in pseudo code:
if (w+1 >= r) {
r = w + 2
}
w++
b[w-1] = value
For a debugging ring buffer (i.e. looking at it in a core file), you have the last value of the write pointer, so you can simply read from write pointer + 1 back around to the write pointer and have your messages in order. This makes the assumption that there is no readers of the debug buffer, so you're only having to deal with the one pointer.
From what I understand, this is the way you'd do it with hardware registers (maintain the read and write indices each with one extra MSB to detect the difference between full/empty).
We've been using similar code in PortAudio since the late 90s[0]. I'm pretty sure Phil Burk got the idea from his hardware work.
Probably it was a joke though one can imagine the size being configurable which surely would lead to interesting results if somebody sets it to 1 for some reason (like troubleshooting).
It's indeed a ridiculous data structure, but I did actually need it.
It's a dynamically sized ring buffer with an optimization analogous to that of C++ strings; if the required capacity is small enough, the buffer is stored inline in the object rather than in a separate heap-allocated object. So something in the spirit of (but not exactly like):
struct rb {
union {
Value* array;
// Set N such that this array uses the same amount of space as the pointer.
Value inline_array[N];
};
uint16_t read;
uint16_t write;
uint16_t capacity;
}
You'd dynamically switch between the two internal representations, and choose whether to read from array or inline_array based on whether capacity is larger than N. In this setup it'd be pretty common for N to be 1. Having to add a special case to every single method would kind of suck, generic code that could handle any size seemed like a nice property to have.
Weirdly, I think Haskell has an equivalent: MVar. It has its (low-level) uses, but its quite hard to get any sort of non-trivial (non-rendezvous) synchronization protocol right. It's incredibly easy to deadlock. (But that may be mostly to do with the MVar's paucity of non-blocking primitives.)
I find the headline very interesting. It's very inviting because of the way it expresses a sort of epiphany about doing it wrong on a mundane programming task. One is tempted to read it in order to see if there is some great insight to this problem. just maybe it's applicable outside this one problem. It begs the question: if he's been doing it wrong on a fairly mundane thing, maybe I am too. I need to see what this is about.
I believe it's very common to find little variations on algorithms or coding style like this that could produce a nice gain in efficiency or elegance. They aren't really the same problem as whole-system engineering, though, since most of your bottlenecks come from the algorithm that is completely unsuitable, not the one that is a little bit suboptimal.
I've always been doing it the "wrong" way, mostly on embedded systems. My classic application is a ring buffer for the received characters over a serial port. What's nice is that this sort of data structure doesn't need a mutex or such to protect access. Only the ISR changes the head, and only the main routine changes the tail.
Dumb question: why use power of two sized rings? If I know the reader won't be more than 100 behind the writer, isn't it better to waste one element of a 101 sized rings instead of 28 of a 128 sized ring?
His favored solution introduces subtlety and complexity. Remember that 20-year old binary search bug in the JDK a few years ago? That is the sort of bug that could be lurking in this solution.
I understand not wanting to waste one slot. A third variable (first, last, count) isn't too bad. But if you really hate that third variable, why not just use first and count variables? You can then compute last from first and count, and the two boundary cases show up as count = 0 and count = capacity.
The most common use for ring buffers is for it to be the intermediary between a concurrent reader and writer (be it two threads, to processes sharing memory, or a software process communicating with hardware). And for that, the index + size representation is kind of miserable. Both the reader and the writer will be writing to the length field, which is bad for caching. The read index and the length will also need to always be read and updated atomically, which would be awkward.
No, the size of the array doesn't need to be a power-of-2 if you use modulus to derive indices. But you need to deal with the overflow somehow. For instance:
Modulus by a power of two is cheap. Modulus by a constant is a multiplication by reciprocal and a shift. And if your argument is in [0..2N], mod N is just a conditional subtraction that doesn't even require a branch.
cheap is relative right? I mean a multiplication can be spread over shift and add/sub instructions whereas a mask is just one instruction I think right?
That's only true if your compiler actually outputs a modulus instruction when it sees you doing N % pow2. It really should optimize that into N & (pow2-1) for you, so whether you write the & or the % it will end up running the cheap & version.
> I've must have written a dozen ring buffers over the years
Why would someone do this instead of re-using previous (or third-party) implementations? Of course unless it's all in different languages, but I don't think that's the case here.
Probably just because it doesn't add to the discussion. Though, from a certain standpoint it shows one of the problems with our education system pretty clearly. This is truly a fundamental technique. I don't know how one gets out of school without knowing it. It doesn't say anything about you, but it says a lot about what we are teaching people. Embarrassingly, for a long time I thought I had invented this technique ;-)
Don't worry. Programming and computer science is one of those things that anyone can learn on their own. If you don't mind some advice, though, try not to be embarrassed by things that you don't know. I can imagine that it is difficult, especially if you don't feel confident about your previous education. Even if most other people already know it, it just means that you have the pleasure of discovering it (as a certain XKCD comic pointed out).
One thing I've said to many people starting out (especially those without an academic background in the area) is that there is a lot to learn. Sometimes at the beginning, you improve so quickly that it is easy to think, "I must be getting close to knowing it all". After several decades in the industry, though, I'm still learning brand new (to me!) , important things every single day. In many ways, the best programmers are the ones who can see how much they don't know, not how much they do know.
I want you to know I genuinely appreciate you taking the time to write that. It's both helpful and uplifting. I've been going through a rough patch professionally and in life, and your comment lifted my spirits and brought me to tears (as absurd as I'm sure that must sound).
From one stranger on the internet to another : thank you.
I was doing this in 1992 so it's at least 12 years older than the 2004 implementation. I suspect it was being done long before that. Back then the read and write indexes were being updated by separate processors (even more fun, processors with different endianness) with no locking. The only assumption being made was that updates to the read/write pointers were atomic (in this case 'atomic' meant that the two bytes that made up a word, counters were 16 bits, were written in atomically). Comically, on one piece of hardware this was not the case and I spent many hours inside the old Apollo works outside Boston with an ICE and a bunch of logic analyzers figuring out what the hell was happening on some weird EISA bus add on to some HP workstation.
It's unclear to me why the focus on a 2^n sized buffer just so you can use & for the mask.
Edit: having had this discussion I've realized that Juho's implementation is different from the 1992 implementation I was using because he doesn't ever reset the read/write indexes. Oops.