I remember how we once ran into trouble with a large timestamp counter in a FPGA implementation. (Was it just 64 bits, or 112 bits? Probably the full PTP timestamp, including fractional nanoseconds for precise drift corrections.)
The extra bits of storage are cheap. The problem is the addition circuitry. With a small counter, you can do addition in a single clock cycle, very easy to implement. With a large counter, addition of the highest bit has to wait for completion of the previous one, etc. so if this takes longer than one clock cycle you have to implement a different, pipelined addition algorithm. (Or run everything at lower clock frequency.)
It can be quite surprising how what might seem like minor differences to the programmer can require major changes to the hardware.
I saw an MITx class on EdX called "Computation Structures" a few years ago and took it for fun. In the second part of that students design at the logic gate level a 32-bit RISC processor, except that we could assume black boxes for the register file and memory.
I considered trying to actually build mine using good old classic 74xx series logic. Mine would have needed 140 quad MUX2 chips, 75 quad AND2 chips, 81 dual MUX4 chips, and 55 quad XOR2 chips, 16 dual D flip flop chips, plus a handful of others.
It was around 370 chips in total.
My design included a shift unit that can do left or right shifts, arithmetic or logical, of 1 to 31 bits in 1 clock cycle.
If I replaced that with a shift unit that could only do 1 bit logical right shift, and changed the instruction set to have a new instruction for that, made the old shift instructions trap, and then emulated them in the trap handler, a whopping 88 of those 140 quad MUX2 chips would no longer be needed.
That would bring it down to around 280 chips. The fancy shifter was almost a quarter of the parts count!
Naive question. Do processors ever have sub-elements that run at a higher clock. I can imagine trying to hack this sort of thing by putting some sort of subprocessor structure that does addition for a particular set of registers at twice normal speed (double length registers?? I'm clearly spitballing). I guess it can't because of memory bandwidth constraints?
> Do processors ever have sub-elements that run at a higher clock
Yes, this is called a "clock domain"; there may be quite a lot of them, and they can often be powered off individually.
> I can imagine trying to hack this sort of thing by putting some sort of subprocessor structure that does addition for a particular set of registers at twice normal speed
It's the other way round: a particular arrangement of logic elements, at a particular size and on a particular wafer process, at a particular temperature, will have a worst-case timing. That timing determines how fast you can possibly clock that part of the circuit.
Adders are annoying because of carry: you can't determine the top bit until you've determined the effect of carry on all the other bits. So if it takes, say, 250ps to propagate through your 32-bit adder, you can clock that at 4GHz. If you widen it to 64 bits that takes 500ps, and now you can only clock that bit at 2GHz.
You may know this, but the person you responded to almost certainly doesn't based on their question, so:
Carry look-ahead adders are a thing. The number of logic levels for computing the highest carry bit is logarithmic in the width of the numbers being added, not linear. Doubling the width of the numbers does not cut your clock rate in half, though you do have to pay for the faster cycle time in added area (more logic gates). There are all sorts of different trade-offs that are possible in the constant terms, but the standard adder designs have linear area in the number of bits, and logarithmic propagation time from inputs to outputs.
> the standard adder designs have linear area in the number of bits, and logarithmic propagation time from inputs to outputs.
It's a linear number of gates, but slightly worse than linear area due to routing. (I've looked for a strictly-linear design before, so if you have such a design, I'd quite appreciate a citation.) It's still much better than N log N area, though, so your point stands.
If you lay out a carry-lookahead tree the naive way, it occupies a N x logN rectangle with the N-bit adder along the N side, but most of the rectangle is empty (each layer has half as many gates as the previous). You can easily pack it more densely on a ad hoc basis, but I don't know of a systematic approach, so possibly it's only a (mildly large) constant factor better than the naive way at scale.
Normally you'll do the opposite: you'll have sub-parts which run at a lower clock rate. The 'core clock' is generally the fastest clock in the system, at least outside of specific high-speed transceivers. The most common approach is to pipeline the operation, which increases latency of the operation but still gives you the same throughput so long as the output does not need to feed back into the input within the operation time.
To add onto this you can easily increase the clock for specific clock domains using a PLL (Phase locked loop) as a clock multiplier.
This comes with all kinds of caveats and complications(PLL clock multiplication is "fuzzy" and subject to phase drift and variations in actual frequency while normal clock division is essentially as close to exact as you can really get). Because of this, like mentioned it's preferable to use clock division instead of clock multiplication where possible.
> With a large counter, addition of the highest bit has to wait for completion of the previous one, etc. so if this takes longer than one clock cycle you have to implement a different, pipelined addition algorithm. (Or run everything at lower clock frequency.)
Kogge Stone carry lookahead.
You can calculate the upper bits in parallel using Kogge-stone (which is the predecessor to what GPU programmers call "prefix-sum" parallelism)
The extra bits of storage are cheap. The problem is the addition circuitry. With a small counter, you can do addition in a single clock cycle, very easy to implement. With a large counter, addition of the highest bit has to wait for completion of the previous one, etc. so if this takes longer than one clock cycle you have to implement a different, pipelined addition algorithm. (Or run everything at lower clock frequency.)