You get a fast link between the GPUs, which should help when you’ve got a model split between them.
However, that split isn’t automatic. You can’t expect to run a 40GB model on that, unless perhaps if it’s been designed for that—the way llama.cpp can split a model between the GPU and CPU, for instance.
What you can do without trouble is keep more models loaded, do more things at the same time, and occasionally run the same model at double speed if it batches well.
CUDA multi-GPU with NVLink is pretty well tested with shared memory space. You still want to use NCCL to optimize the allocation, but many CUDA-aware libraries (and their subsequent ML tools) are capable.
I think you need enterprise grade cards for it to make it work. If I remember correctly consumer cards with nvlink can't share resources to host a 40GB model in vram.