I find their explanation of the pipelining issue slightly confusing, probably because they tried to simplify it to the extreme:
>One could object to this and note that due to shorter clock ticks, the small steps will be executed faster, so the average speed will be greater. However, the following diagram shows that this is not the case.
Said diagram shows that the two-clock-tick step locks the pipeline, that is you can't execute the first clock tick for the next instruction if you're still running the 2nd part of the previous one. When would this be the case? Isn't entire point of pipelining to divide a function into smaller steps that can be run in parallel? If you can split "step 3" across two clock cycles, couldn't you effectively subdivide it into two steps that could run in parallel?
I suppose that eventually you run across the issue that adding additional pipelining stages increases the logic size which in turn causes it to run slower or something like that. I wish the document was a little more specific, after all it doesn't hesitate to throw the physical formulas for power dissipation in the 2nd part so clearly it's not afraid to dig into technical details.
> If you can split "step 3" across two clock cycles, couldn't you effectively subdivide it into two steps that could run in parallel?
There is a difference between splitting "step 3" across two clock cycles and splitting "step 3" into two separate steps. The underlying assumption here is that "step 3" is indivisible. E.g. say "step 3" was memory access and the latency for that is 500 picoseconds, it's not like you can just split it into two steps and make it load faster.
>One could object to this and note that due to shorter clock ticks, the small steps will be executed faster, so the average speed will be greater. However, the following diagram shows that this is not the case.
Said diagram shows that the two-clock-tick step locks the pipeline, that is you can't execute the first clock tick for the next instruction if you're still running the 2nd part of the previous one. When would this be the case? Isn't entire point of pipelining to divide a function into smaller steps that can be run in parallel? If you can split "step 3" across two clock cycles, couldn't you effectively subdivide it into two steps that could run in parallel?
I suppose that eventually you run across the issue that adding additional pipelining stages increases the logic size which in turn causes it to run slower or something like that. I wish the document was a little more specific, after all it doesn't hesitate to throw the physical formulas for power dissipation in the 2nd part so clearly it's not afraid to dig into technical details.