> Games on the N64 are often limited by memory bandwidth, which is taken up by rasterization
So a lot of that is overstated, IMO.
The N64 was in most cases the first system with a modern memory hierarchy game developers had come across. From my experiments, rdram is dozens of cycles away from the CPU at least, so it's pretty easy to be memory bound without coming close to saturating the little more than ~200MB/sec memory bandwidth the system is capable of sustaining. Remember too the context of the mid 90s, where the previous Nintendo console had single cycle access to main memory[0]. So you'd see crazy stuff like loop unrolling into 16KB of straight code with no branches despite the CPU only having a 16KB instruction cache, guaranteeing that you just flushed everything else.
Similarly the GPU seemed really hampered by it's small FIFO at basically the ROP stage which meant that the RMW of both color and z buffers meant a lot of pipeline stalls. I think a lot of the benefit of switching to z sort late in the system's cycle had less to do with overall memory bandwidth, but instead the fact that you don't have to wait on memory just to then blit out the pixel. There's no z to check against, so you know that yes, as soon as the pixel hits the ROP stage, the GPU can just write it to memory (assuming alpha isn't involved).
And that's not to shit of the devs in question; it wasn't until probably the last half of the PS2 era that the industry as a whole really internalized how to approach true long tail memory hierarchies. I certainly hadn't despite being able to regurgitate the textbook definitions involved.
I guess what I'm saying: hey demo scene coders, there's a bunch of untapped power in this system with weird hangups. : )
[0] Yes, the memory system of the SNES is hard to talk about in broad strokes like this with FastROM/SlowROM, wait states on cart mem accesses, etc, but 'single cycle' is the right order of magnitude for this discussion.
Maybe that’s right, memory bandwidth is not the right explanation. But the rasterization itself is still often a big bottleneck, and maybe chalking it up to fill rate limitations is more explicatory. This is based on my limited observations writing my own N64 code and trying different scenes that put different amounts of load on the RDP—varying the load on the RDP was by far the easiest way to get framerate drops. If the CPU load is relatively constant, if the RSP load is relatively constant, then explaining it as running into fill rate limitations seems like a good theory to me.
I’m a little skeptical that there’s “huge” untapped power in this system, given the performance of some later games like World Driver Championship (1999). There’s certainly a lot of processing power across the CPU, RSP, and RDP, but given how heterogeneous the system is, and how many weird hangups there are, I have doubts that we’re going to see something much better come out of the demoscene community. You need a lot of appetite for a long-term project in order to make something impressive on the N64, and while we have better emulators and compilers now, it’s hard to compete against someone in the 1990s who got to spend multiple years on the system full-time, with the support of a team and from the console developers.
There are some tricks I can imagine using, like spending more time with the RDP in single-cycle mode, or rendering just the fields to get 480i at the cost of 240p, but there are just so many thorny problems to deal with.
This is speaking as someone who participates both in the demoscene (I was just in Boston for @party), and N64 homebrew (you can find me on the Discord).
In one of Kaze's older videos [1], he tracks both CPU and RCP time over various optimisations. The numbers make it pretty apparent that you can get significant RDP performance gains by reducing the amount of CPU bandwidth and/or improving it's access locality.
The N64 is a unified memory system, and memory stalls triggered by the CPU will slow RDP down.
The N64's memory controller appears to be very simple. As I understand it, if one sub-component attempts to access a DRAM row that is closed (DRAM rows are 2KB long), then the entire memory controller stalls for ~100ns as the RDRAM chip closes the previous row (writing out any dirty changes) and opens the new row.
The controller doesn't appear to do any reordering to optimise access patterns. If multiple components are accessing the same 1MB bank simultaneously, you can hit pathological bad cases were the entire system slows down, as the memory controller is continually stalling for 100ns at a time while the memory chips close and reopen rows.
Which is why some games can enable a high resolution mode when the memory expansion pack is present. They often don't need the extra 4MB of ram, but simply being able to strategically spread their data across eight different banks instead of four banks can massively improve performance.
The bank organization is relatively well-known, so maybe you put the framebuffer in one bank, you put the zbuffer in another bank, and you have two banks left over without touching expansion memory. I mentioned World Driver Championship specifically because it runs at 640x480, and runs well, without the expansion pack.
Yes, memory access by the CPU will slow the RDP down. But the RDP is plenty slow even when the CPU is idle. The reason that we are seeing such improvements with SM64 is because SM64 was in such bad shape to begin with—something to be expected, given the novelty of 3D hardware in 1996 and problems with compiler bugs.
Kaze explained once that a big part of performance issues was related to fill rate which I guess is part of the memory bandwidth issue. Essentially, polygon counts aren't a huge issue provided they don't use a large amount of screen space. While the Z-buffer will help out with larger polygons, it isn't a silver bullet solution.
That said, I don't doubt SM64 was in bad shape, this was really Nintendo's 1st attempt at doing 3D at that scale and almost everything was experimental. But even with that, it is amazing to see just how well a lot of it works despite this. But seeing some of the later stuff released on N64, it did seem like a really tapped out resource. Mind you looking at the Portal 64 coming along, it is neat to see that folks are still trying to push it just a little bit more.
Back to SM64, one thing I still find really cool is when you get Mario on a rotating platform, the rotation position and angle is calculated as it should be for both location and angle. That is just slick to see on something that old.
Your example of World Driver Championship is important; It had a custom RDP microcode implementation to achieve that. The RDP microcode implementations Nintendo distributed were pretty mediocre, either being very slow but accurate, or fast and too inaccurate, even for 3D games. Some of the most impressive games on the console were only achieved by implementing custom microcode, which was excessively difficult because Nintendo refused to distribute resources to help do so, despite that being an intended use of the console's hardware.
Another huge problem in rasterization was that pretty much all graphics resources had to be in a 4kb texture cache to be rasterized, and unless you micromanaged that cache really effectively (and it was cut in half if you wanted certain RDP features!) that cache constantly had the wrong data, and this would slow everything down. If they had given it just a bit more cache for textures, it likely would have been much closer to it's claimed peak performance in normal usage.
Isnt World Driver Championship fast because it doesnt use Zbuffer in the first place freeing half of memory BW normally used by gpu? Afair their custom RDP code sorted triangles back to front and just hoped you wouldnt notice glitches.
So a lot of that is overstated, IMO.
The N64 was in most cases the first system with a modern memory hierarchy game developers had come across. From my experiments, rdram is dozens of cycles away from the CPU at least, so it's pretty easy to be memory bound without coming close to saturating the little more than ~200MB/sec memory bandwidth the system is capable of sustaining. Remember too the context of the mid 90s, where the previous Nintendo console had single cycle access to main memory[0]. So you'd see crazy stuff like loop unrolling into 16KB of straight code with no branches despite the CPU only having a 16KB instruction cache, guaranteeing that you just flushed everything else.
Similarly the GPU seemed really hampered by it's small FIFO at basically the ROP stage which meant that the RMW of both color and z buffers meant a lot of pipeline stalls. I think a lot of the benefit of switching to z sort late in the system's cycle had less to do with overall memory bandwidth, but instead the fact that you don't have to wait on memory just to then blit out the pixel. There's no z to check against, so you know that yes, as soon as the pixel hits the ROP stage, the GPU can just write it to memory (assuming alpha isn't involved).
And that's not to shit of the devs in question; it wasn't until probably the last half of the PS2 era that the industry as a whole really internalized how to approach true long tail memory hierarchies. I certainly hadn't despite being able to regurgitate the textbook definitions involved.
I guess what I'm saying: hey demo scene coders, there's a bunch of untapped power in this system with weird hangups. : )
[0] Yes, the memory system of the SNES is hard to talk about in broad strokes like this with FastROM/SlowROM, wait states on cart mem accesses, etc, but 'single cycle' is the right order of magnitude for this discussion.