Couple things - It's worth pointing out that AWS Elasticsearch is not simply the...

Couple things -

It's worth pointing out that AWS Elasticsearch is not simply the open distro - and a lot of things you see in the open distro are not currently available on AWS Elasticsearch. Beyond that, many features are simply forcibly disabled in Amazon's offering, just as many cluster settings and APIs are untouchable in the AWS service (even read-only ones that would be super helpful).

I still can't touch any of the rebalancing settings on my clusters and everything looks forcibly disabled. If rebalancing worked as expected, the whole blue-green thing shouldn't be necessary, and over time, I wouldn't generally end up with a single full data node while every other node in the cluster has 300GB free. Am I missing something?

None of the CloudWatch alarms you linked to have much relevance to the issues in the article (Other than the ClusterIndexWritesBlocked alarm which will only start firing after everything breaks). As of the last time I looked, you cannot monitor disk space on individual nodes in CloudWatch, only the cluster as a whole. Alerting on a single node starting to fill up is basically the one alert that would let me know things are about to be in a bad state.

Their service seems to work well for small implementations that use EBS-backed storage, and I bet that's what most of their customers are using, but I'm running 60+ node clusters and the problems only seem to be worse as capacity goes up.

Someone in the comments here mentioned they destroy and rebuild their cluster weekly just to keep the shards balanced. How ridiculous is it for that to be the best option offered?