The other thing is nVidia try and sell GPUs with similar performance at two very different prices. One price for data centres and a quite different price to kids. If you do the job yourself you can often get away with using the much cheaper gamer grade cards for AI work (unless you need a lot of VRAM), whereas such as AWS can't do that and are required by nVidia to use the considerably more expensive cards. If your workload will fit on a gamer grade card there's no contest on price between an on-prem system and the cloud.
That is a really good point, and the 3090s have a surprising amount of VRAM on them. For many smaller models this is sufficient. However, where I work without going into a lot of specifics, because of the size of the models, the amount of VRAM is crucial, as well as the infrastructure of the PCI lanes connected to it, the speed of the local storage, and the networking between both cards on the same node as well as between nodes.
The moment the model gets to be bigger than the size of any one GPU's VRAM, the higher by orders of magnitude of difficulty in the process of training that model.