TPUs have a network topology better suited for long context than gpus: https://j...

TPUs have a network topology better suited for long context than gpus: https://jax-ml.github.io/scaling-book/tpus/#tpu-networking

> This nearest-neighbor connectivity is a key difference between TPUs and GPUs. GPUs connect up to 256 H100s in an all-to-all configuration (called a node), rather than using local connections. On the one hand, that means GPUs can send arbitrary data within a node in a single low-latency hop. On the other hand, TPUs are dramatically cheaper and simpler to wire together, and can scale to much larger topologies because the number of links per device is constant.