Gains by what metric? Are you sure you didn't trade in better latency for worse overall throughput? Also, sure you didn't hit one of many CFS overaccounting bugs which we've seen a few? Have you compared performance without the limit at all?
Previously we had no limit. We observed gains in both latency and throughput by implementing Automaxprocs and decided to roll it out widely.
This aligns with what others have reported on the Go runtime issue open for this.
"When go.uber.org/automaxprocs rolled out at Uber, the effect on containerized Go services was universally positive. At least at the time, CFS imposed such heavy penalties on Go binaries exceeding their CPU allotment that properly tuning GOMAXPROCS was a significant latency and throughput improvement."