As always, take those t/s stats with a huge boulder of salt. The demo shows a question "solved" in < 500 tokens. Still amazing that it's possible, but you'll get nowhere near those speeds when dealing with real-world problems at real-world useful context lengths for "thinking" models (8-16k tokens). Even epyc's with lots of channels go down to 2-4 t/s after ~4096 context length.