One of my constant pain point with all distributed data stores is that it's real...

One of my constant pain point with all distributed data stores is that it's really hard to find out how they behave if something breaks. Be it the network, local storage and so on. How do I find out what's wrong? Are there guides on how to fix a problem? What happens if I lost more nodes than required to automatically recover? How does backup and restore work? Any estimates on how long a restore will take? Are there failure modes that I should monitor for that might be non-obvious? This is mostly the operations side, but personally I would never use something I don't understand enough to have a good feeling of how the system works beneath the shiny surface.

And of course there's the application side: With SQL and EXPLAIN, I can usually see bottlenecks. I have a latent fear that performance with distributed systems suddenly tanks if some structure is suddenly split across nodes for example.