This is available only under certain architectures (x86 is one of them as you've noted), but I've also seen scenarios where the atomic increment instruction is actually slower than using a mutex. Don't consider this feature to be a magic bullet and always test your use cases!
As for the parent discussion, usually a mutex is talking about the threading construct while a "lock service" is how you'd refer to something like etcd or zookeeper.
Mutexes are typically implemented over atomic instructions. So you'd do something like atomic compare/exchange to acquire the mutex and if there's no contention you got it. If there is contention you go to the OSes synchronization constructs which are typically much slower... An atomic increment should always be faster than acquiring a mutex and incrementing...
As for the parent discussion, usually a mutex is talking about the threading construct while a "lock service" is how you'd refer to something like etcd or zookeeper.