I find this unlikely. Do have some real world evidence for this? Microbenchmarks are not at all useful for this stuff in my experience. For speed: In the uncontended case, which is what we mostly care about because if you're contended it's game over for speed, they're both a single atomic CAS, so that's no difference.
In terms of size, the pointer is bigger than you'd want - it's no pthread_mutex or the analogous Windows data structure where it's a multi cache line disaster of a data structure - but it's clearly worse than a futex or WaitOnAddress solution.
A few years ago I compared them, it was not a microbenchmark, but a real application. There were a few million(almost entirely uncontended) exclusive locks being taken on startup, SRWLock was consistently faster, though the difference was not large.