In general yes, you can (and do) shard the model over multiple GPUs. If you want to do that yourself look at DeepSpeed or FSDP . There is a communication overhead though and the speed at which the GPUs can communicate is key. Thats where NVLink comes in btw.
So yes, it’s actually what you can and do do. However this limits your ability to iterate on the models quickly and from what I‘ve read a lot of times the foundational labs throw out their models because by the time they are done training they are already outdated.