Is using pthread_key_create / pthread_setspecific / pthread_getspecific / pthread_key_delete better or worse or the same? As I understand it, this is the low level API you were looking for.
Ah, but you can "inline" it by hardcoding its implementation details into your code in the most horrible way imaginable, as suggested by a crafty reader and which, scarily enough, seems to be the fastest and most robust TLS access method from a shared library: https://yosefk.com/blog/cxx-thread-local-storage-performance...