The hotswap is unfortunate, and likely a resukt of server/consumer space differentiation among all kinds of vendors (os, cpu, motherboar, and even the nvme manufacturers all have to play along). But the limit of 4 is at least a real technical limitation, each one of those addin cards usess a pcie 16x slot (either real or "dimm.2" ) and each card needs A 4x from it. You could use a mux to add more but they're already getting to the point of being able to saturate the links. Pcie 4.0 and 5.0 will give a lot of headroom for more drives on a system.
Sounds like we might need to go back to the kind of mainframe architecture that has IO offload. Split the PCIe bus into NUMA-like zones; give each zone its own (probably ARM) CPU, running its own kernel; then use "application processors" (probably x86) to command-and-control the IO zones, allocating e.g. IOMMU-subvirtualized ethernet channels to them. Control plane/data plane separation.
There's some (to me) interesting work in this area. See for example this talk[1], where they show how a RISC-V CPU with a narrow and slow PCIe link can orchestrate the direct transfer of data between two PCIe devices (say NVME and Ethernet card), saturating the x16 link between them.
sort of, the network part of that ends up being a huge bottleneck then too, with 16 drives at 5GB/s (max i've seen so far) each you've got 80GB/s you need for the network to each server. You start getting into the really expensive side of things speed wise.
Also, most CPUs you can buy have around 40-64 PCIe lanes, limiting you to 10-16 drives if you want full speed out of them (this also leaves you with no lanes for ethernet).