Hacker News new | past | comments | ask | show | jobs | submit login

Some valid points and some relevant real-world aws support nightmare scenarios, though I think there is a chance the author might be wrong abt a few things, or may be I misunderstood them. My 2 cents:

> Amazon’s implementation is missing a lot of things like RBAC and auditing.

Open-distro (which AWS uses for elasticsearch deployments) supports this: https://opendistro.github.io/for-elasticsearch/features/secu...

> Shard rebalancing, a central concept to Elasticsearch working as well as it does, does not work on AWS’s implementation.

Not sure why the author says AWS doesn't support it, but I have seen that it does rebalance shards just like vanilla elasticsearch would. In fact, it wouldn't rebalance only when the shard-allocator is unable to find suitable home for the unassigned shards (and that's vanilla behaviour, iirc): https://aws.amazon.com/blogs/opensource/open-distro-elastics...

> ...if a single node in your Elasticsearch cluster runs out of space, the entire cluster stops ingesting data, full stop. Amazon’s solution to this is to have users go through a nightmare process of periodically changing the shard counts in their index templates and then reindexing their existing data into new indices, deleting the previous indices, and then reindexing the data again to the previous index name if necessary.

I think the author should employ alerts for cluster-health https://docs.aws.amazon.com/elasticsearch-service/latest/dev... or write them https://github.com/opendistro-for-elasticsearch/alerting and def read abt best practices for offloading petabyte-scale clusters to aws (I am sure they've read abt it already, given they're in touch with SMEs and TAMs): https://aws.amazon.com/blogs/database/run-a-petabyte-scale-c...

> Hope you had a backup of what you needed to dump.

Amazingly, AWS Elasticsearch does automated hourly backups and retains them for 14 days, for free: https://aws.amazon.com/about-aws/whats-new/2019/07/amazon-el...

> The second option is to add more nodes to the cluster or resize the existing ones to larger instance types.

AWS Elasticsearch doesn't yet scale-out (change in instance-count) without resorting to blue-green deployments. They should have impl that by now, like they did for policy-updates: https://aws.amazon.com/about-aws/whats-new/2018/03/amazon-el... I hope fixing this is in their roadmap.

----

Also, I believe the real problem with managed elasticsearch offering is that end-users still have to worry abt the servers as it isn't truly hands-off, in a way that Lambda or DynamoDB are. This is complicated by the fact that elasticsearch exposes innumerable ways to configure cluster and index setups (read: shoot yourself in the foot).

I guess, AWS Elasticsearch needs something like an Aurora Serverless Data API as the current offering takes away too much control away from power-users (can't ssh into the nodes to fix anything at all and the constant reliance on the oft-incompetent aws support to do the firefighting whilst having to frustratingly wait on the fringes with little to no transparency is a big red-flag): https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

With perf-analyzer and automatic-index-management included in open-distro, they might already be half-way there: https://news.ycombinator.com/item?id=19361847




Couple things -

It's worth pointing out that AWS Elasticsearch is not simply the open distro - and a lot of things you see in the open distro are not currently available on AWS Elasticsearch. Beyond that, many features are simply forcibly disabled in Amazon's offering, just as many cluster settings and APIs are untouchable in the AWS service (even read-only ones that would be super helpful).

I still can't touch any of the rebalancing settings on my clusters and everything looks forcibly disabled. If rebalancing worked as expected, the whole blue-green thing shouldn't be necessary, and over time, I wouldn't generally end up with a single full data node while every other node in the cluster has 300GB free. Am I missing something?

None of the CloudWatch alarms you linked to have much relevance to the issues in the article (Other than the ClusterIndexWritesBlocked alarm which will only start firing after everything breaks). As of the last time I looked, you cannot monitor disk space on individual nodes in CloudWatch, only the cluster as a whole. Alerting on a single node starting to fill up is basically the one alert that would let me know things are about to be in a bad state.

Their service seems to work well for small implementations that use EBS-backed storage, and I bet that's what most of their customers are using, but I'm running 60+ node clusters and the problems only seem to be worse as capacity goes up.

Someone in the comments here mentioned they destroy and rebuild their cluster weekly just to keep the shards balanced. How ridiculous is it for that to be the best option offered?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: