> Games on the N64 are often limited by memory bandwidth, which is taken up by r...

dietrichepp · on June 28, 2023

Maybe that’s right, memory bandwidth is not the right explanation. But the rasterization itself is still often a big bottleneck, and maybe chalking it up to fill rate limitations is more explicatory. This is based on my limited observations writing my own N64 code and trying different scenes that put different amounts of load on the RDP—varying the load on the RDP was by far the easiest way to get framerate drops. If the CPU load is relatively constant, if the RSP load is relatively constant, then explaining it as running into fill rate limitations seems like a good theory to me.

I’m a little skeptical that there’s “huge” untapped power in this system, given the performance of some later games like World Driver Championship (1999). There’s certainly a lot of processing power across the CPU, RSP, and RDP, but given how heterogeneous the system is, and how many weird hangups there are, I have doubts that we’re going to see something much better come out of the demoscene community. You need a lot of appetite for a long-term project in order to make something impressive on the N64, and while we have better emulators and compilers now, it’s hard to compete against someone in the 1990s who got to spend multiple years on the system full-time, with the support of a team and from the console developers.

There are some tricks I can imagine using, like spending more time with the RDP in single-cycle mode, or rendering just the fields to get 480i at the cost of 240p, but there are just so many thorny problems to deal with.

This is speaking as someone who participates both in the demoscene (I was just in Boston for @party), and N64 homebrew (you can find me on the Discord).

phire · on June 28, 2023

In one of Kaze's older videos [1], he tracks both CPU and RCP time over various optimisations. The numbers make it pretty apparent that you can get significant RDP performance gains by reducing the amount of CPU bandwidth and/or improving it's access locality.

The N64 is a unified memory system, and memory stalls triggered by the CPU will slow RDP down.

The N64's memory controller appears to be very simple. As I understand it, if one sub-component attempts to access a DRAM row that is closed (DRAM rows are 2KB long), then the entire memory controller stalls for ~100ns as the RDRAM chip closes the previous row (writing out any dirty changes) and opens the new row.

The controller doesn't appear to do any reordering to optimise access patterns. If multiple components are accessing the same 1MB bank simultaneously, you can hit pathological bad cases were the entire system slows down, as the memory controller is continually stalling for 100ns at a time while the memory chips close and reopen rows.

Which is why some games can enable a high resolution mode when the memory expansion pack is present. They often don't need the extra 4MB of ram, but simply being able to strategically spread their data across eight different banks instead of four banks can massively improve performance.

[1] https://www.youtube.com/watch?v=t_rzYnXEQlE

dietrichepp · on June 28, 2023

The bank organization is relatively well-known, so maybe you put the framebuffer in one bank, you put the zbuffer in another bank, and you have two banks left over without touching expansion memory. I mentioned World Driver Championship specifically because it runs at 640x480, and runs well, without the expansion pack.

Yes, memory access by the CPU will slow the RDP down. But the RDP is plenty slow even when the CPU is idle. The reason that we are seeing such improvements with SM64 is because SM64 was in such bad shape to begin with—something to be expected, given the novelty of 3D hardware in 1996 and problems with compiler bugs.

NovaDudely · on June 28, 2023

Kaze explained once that a big part of performance issues was related to fill rate which I guess is part of the memory bandwidth issue. Essentially, polygon counts aren't a huge issue provided they don't use a large amount of screen space. While the Z-buffer will help out with larger polygons, it isn't a silver bullet solution.

That said, I don't doubt SM64 was in bad shape, this was really Nintendo's 1st attempt at doing 3D at that scale and almost everything was experimental. But even with that, it is amazing to see just how well a lot of it works despite this. But seeing some of the later stuff released on N64, it did seem like a really tapped out resource. Mind you looking at the Portal 64 coming along, it is neat to see that folks are still trying to push it just a little bit more.

Back to SM64, one thing I still find really cool is when you get Mario on a rotating platform, the rotation position and angle is calculated as it should be for both location and angle. That is just slick to see on something that old.

mrguyorama · on June 28, 2023

Your example of World Driver Championship is important; It had a custom RDP microcode implementation to achieve that. The RDP microcode implementations Nintendo distributed were pretty mediocre, either being very slow but accurate, or fast and too inaccurate, even for 3D games. Some of the most impressive games on the console were only achieved by implementing custom microcode, which was excessively difficult because Nintendo refused to distribute resources to help do so, despite that being an intended use of the console's hardware.

Another huge problem in rasterization was that pretty much all graphics resources had to be in a 4kb texture cache to be rasterized, and unless you micromanaged that cache really effectively (and it was cut in half if you wanted certain RDP features!) that cache constantly had the wrong data, and this would slow everything down. If they had given it just a bit more cache for textures, it likely would have been much closer to it's claimed peak performance in normal usage.

rasz · on June 29, 2023

Isnt World Driver Championship fast because it doesnt use Zbuffer in the first place freeing half of memory BW normally used by gpu? Afair their custom RDP code sorted triangles back to front and just hoped you wouldnt notice glitches.