One of the benefits of the original algorithm is the independence of the read and write indexes, they can be updated from different threads (or different processors!) without any atomic operations beyond writing or reading a value. Subtracting from both pointers requires an additional atomic read/modify/write operation.
You could also just restrict the pointers in the normal way but to two times the size of the buffer. So instead of wrapping at N you wrap at 2*N.
You are only encoding 1 bit of data (first or second) so adding more data than that by allowing unsigned integer overflow is just an optimization, not fundamentally necessary.
If you do that then the size() function becomes a problem. The original implementation relies on unsigned integer wrap-around to give the proper result when write < read.
It all depends on your use case. In my experience, most of the times when dealing with RB:s, it's more important to guarantee that the read index and the write index can be updated lock-free by different threads (e.g. one producer and one consumer) than to have a very specific non-power-of-2 capacity.
- subtracting buffer size from both pointers once the read pointer has wrapped.
- choosing a longer int for the math operation where possible
That seems a small price for the freedom to be able to choose an appropriate buffer size.