The amount of memory you can put on a GPU is mainly constrained by the GPU's memory bus width (which is both expensive and power hungry to expand) and the available GDDR chips (generally require 32bits of the bus per chip). We've been using 16Gbit (2GB) chips for awhile, and they're just starting to roll out 24Gbit (3GB) GDDR7 modules, but they're expensive and in limited demand. You also have to account for VRAM being somewhat power hungry (~1.5-2.5w per module under load).
Once you've filled all the slots your only real option is to do a clamshell setup that will double the VRAM capacity by putting chips on the back of the PCB in the same spot as the ones on the front (for timing reasons the traces all have to be the same length). Clamshell designs then need to figure out how to cool those chips on the back (~1.5-2.5w per module depending on speed and if it's GDDR6/6X/7, meaning you could have up to 40w on the back).
Some basic math puts us at 16 modules for a 512 bit bus (only the 5090, have to go back a decade+ to get the last 512bit bus GPU), 12 with 384bit (4090, 7900xtx), or 8 with 256bit (5080, 4080, 7800xt).
A clamshell 5090 with 2GB modules has a max limit of 64GB, or 96GB with (currently expensive and limited) 3GB modules (you'll be able to buy this at some point as the RTX 6000 Blackwell at stupid prices).
HBM can get you higher amounts, but it's extremely expensive to buy (you're competing against H100s, MI300Xs, etc), supply limited (AI hardware companies are buying all of it and want even more), requires a different memory controller (meaning you'll still have to partially redesign the GPU), and requires expensive packaging to assemble it.
What of previous generations of HBM? Older consumer AMD GPUs (Vega) and Titan V had HBM2. According to https://en.wikipedia.org/wiki/Radeon_RX_Vega_series#Radeon_V... you could get 16GB with 1TB/s for $700 at release. It is no longer use in data centers. I'd gladly pay $2800 for 48GB with 4TB/s.
Interesting. So a 32-chip GDDR6 clamshell design could pack 64GB VRAM with about 2TB/s on a 1024bit bus, consuming around 100W for the memory subsystem? With current chip prices [1], this would cost just about 200$ (!) for the memory chips, apparently. So theoretically, it should be possible to build fairly powerful AI accelerators in the 300W and < 1000$ range. If one wanted to, that is :)
Hardware-wise instead of putting the chips on the PCB surface one would mount an 16-gonal arrangement of perpendicular daughterboards, each containing 2-16 GDDR chips where there would be normally one, with external liquid cooling, power delivery and PCIe control connection.
Then each of the daughterboards would feature a multiplexer with a dual-ported SRAM containing a table where for each memory page it would store the chip number to map it to and it would use it to route requests from the GPU, using the second port to change the mapping from the extra PCIe interface.
API-wise, for each resource you would have N overlays and would have a new operation allowing to switch the resource overlay (which would require a custom driver that properly invalidates caches).
This would depend on the GPU supporting the much higher latency of this setup and providing good enough support for cache flushing and invalidation, as well as deterministic mapping from physical addresses to chip addresses, and the ability to manufacture all this in a reasonably affordable fashion.
GPUs use special DRAM that has much higher bandwidth than the DRAM that's used with CPUs. The main reason they can achieve this higher bandwidth at low cost is that the connection between the GPU and the DRAM chip is point-to-point, very short, and very clean. Today, even clamshell memory configuration is not supported by plugging two memory chips into the same bus, it's supported by having the interface in the GDDR chips internally split into two halves, and each chip can either serve requests using both halves at the same time, or using only one half over twice the time.
You are definitely not passing that link through some kind of daughterboard connector, or a flex cable.
To get 128GB of RAM on a GPU you'd need at least a 1024 bit bus. GDDR6x is 16Gbit 32 pins, so you'd need 64 GDDR6x chips, which good luck even trying to fit that around the GPU die since traces need to be the same length, and you want to keep them as short as possible. There's also a good chance you can't run a clamshell setup so you'd have to double the bus width to 2048 because 32 GDDR6x chips would kick off way too much heat to be cooled on the back of a GPU. Such a ridiculous setup would obviously be extremely expensive and would use way too much power.
A more sensible alternative would be going with HBM, except good luck getting any capacity for that since it's all being used for the extremely high margin data center GPUs. HBM is also extremely expensive both in terms of the cost of buying the chips and due to it's advanced packaging requirements.
You do not need a 1024-bit bus to put 128GB of some DDR variant on a GPU. You could do a 512-bit bus with dual rank memory. The 3090 had a 384-bit bus with dual rank memory and going to 512-bit from that is not much of a leap.
This assumes you use 32Gbit chips, which will likely be available in the near future. Interestingly, the GDDR7 specification allows for 64Gbit chips:
> the GDDR7 standard officially adds support for 64Gbit DRAM devices, twice the 32Gbit max capacity of GDDR6/GDDR6X
Yeah, the idea that you're limited by bus width is kind of silly. If you're using ordinary DDR5 then consider that desktops can handle 192GB of memory with a 128-bit memory bus, implying that you get 576GB with a 384-bit bus and 768GB at 512-bit. That's before you even consider using registered memory, which is "more expensive" but not that much more expensive.
And if you want to have some real fun, cause "registered GDDR" to be a thing.
They're talking about the 16 sampled texture binding limit which is the same as webgl2. If you look at eg. the list of devices that are stuck with that few texture bindings they don't even support basic GL with compute shaders or vulkan, so they can't even run webgpu in the first place.
Counterpoint, play any game without anticheat and realize just how much worse it can be. (and before someone says it, custom servers with dedicated admins doesn't scale and tends to cause lots of petty drama)
Intel Arc GPUs are terrible for Nanite rendering, since they lack hardware support for both indirect draws (widely used in GPU driven renderers, Intel emulates it in software which is slow) and 64bit atomics, which are required for nanite.
It looks like they're still actively maintaining a bunch of rust crates, and they're still developing Wim (their blobby Roblox competitor written in rust).
They have pulled back from the rust ecosystem quite a bit though since repi (their former CTO) left shortly after The Finals released. They stopped all their FOSS sponsorships, and there was [this PR](https://github.com/EmbarkStudios/rust-ecosystem/commit/61f0e...) which definitely doesn't inspire confidence.
Other fps game maps don't have licenses that let you use them to stress test your renderer or game engine. Existing freely available scenes are all too small and poorly made to be proper stress tests with modern hardware (eg. Old sponza is way too light, Intel sponza they just spammed the subdivision modifier to make it stupidly high poly, Bistro is small and really weirdly made, etc).
VR needs a more powerful CPU and GPU than a MacBook Pro needs if you're trying to break into the VR gaming market (not saying you need a faster CPU than an M3, but VR is extremely demanding and will use all the CPU and GPU you can throw at it to maintain 90+fps with multiview rendering), especially since Apple keeps pushing Metal and not supporting Vulkan. (Metal tends to have higher CPU overhead vs Vulkan, and means you have to add an additional rendering backend for existing VR games)
> VR needs a more powerful CPU and GPU than a MacBook Pro needs if you're trying to break into the VR gaming market
No, it doesn't. The Quest series has done well for itself with boosted smartphone chips.
Surely, that means simpler graphics, but the graphics aren't really the thing holding VR back right now. Ease of use, comfort, weight, eye strain, motion sickness, physical feedback, these are all bigger issues imo. (Though I'll admit that larger FoV would help, and that's tied to graphical power)
If you really want powerful VR yes, but this is not what Apple is going for at all. They're only doing some very basic AR usecases with floating windows.
Once you've filled all the slots your only real option is to do a clamshell setup that will double the VRAM capacity by putting chips on the back of the PCB in the same spot as the ones on the front (for timing reasons the traces all have to be the same length). Clamshell designs then need to figure out how to cool those chips on the back (~1.5-2.5w per module depending on speed and if it's GDDR6/6X/7, meaning you could have up to 40w on the back).
Some basic math puts us at 16 modules for a 512 bit bus (only the 5090, have to go back a decade+ to get the last 512bit bus GPU), 12 with 384bit (4090, 7900xtx), or 8 with 256bit (5080, 4080, 7800xt).
A clamshell 5090 with 2GB modules has a max limit of 64GB, or 96GB with (currently expensive and limited) 3GB modules (you'll be able to buy this at some point as the RTX 6000 Blackwell at stupid prices).
HBM can get you higher amounts, but it's extremely expensive to buy (you're competing against H100s, MI300Xs, etc), supply limited (AI hardware companies are buying all of it and want even more), requires a different memory controller (meaning you'll still have to partially redesign the GPU), and requires expensive packaging to assemble it.