What commodity tool should I be replacing it with that gives me:
* super for heterogenous workloads (web servers, big batch loads, streaming workloads)
* autoscaling the workloads
* autoscaling the cluster
* auto-restarting workloads
* basically automatic networking
* the ability to have all that without the need to setup separate infrastructure, and to have the orchestrator move workloads around to optimise machine use
* the ability to hire people with experience in it already
* ability to run stateful workloads, and have disks managed and moved for you
* the ability to do all that by just providing a config
* the ability to do all that without having to let people have actual access to the machines/OS
* per container/workload topology constraints and requirements, that are network and data centre aware, without having to write any code
* have access to a wide range of tooling and docs
* be able to drop in workloads that automatically handle log collection, etc. without needing to configure each deployment.
* ability to swap out Network ingress/reverse proxy without having to redeploy or reconfigure your applications.
* ability to drop in any number of tools to do traffic shadowing, mirroring, splits, etc without reconfiguring or redeploying your applications
* ability to do all this without your devs needing to be networking experts.
* not locked to a specific cloud provider
I’m sure there’s tools out there that cover different subsets of those features, there’s probably tools that do all of them and more. K8s isn’t perfect, but it gives you an awful lot for some reasonable tradeoffs. Sure, if you’re deploying a single web-server/app, you shouldn’t bother with all this; and data centre level stuff probably has tools that make these features look like a cute picnic, but many devs I’ve worked with haven’t the capacity, ability or desire to figure that out.
> If you’re on a cloud provider? You’ve already got secure, restartable, restorable, on-demand-growable, discoverable, fail-over-able, imaginary machines a zillion times more robust
Sure, but you’ve still got to figure out how you want to setup, monitor, configure all of that, then stitch it together and maintain it, and also not get in the way of your main application development. To me, that’s what K8s gives me: I rely on the cloud provider to give me good machines, good networking, good storage primitives, and to heal/replace them when they fail, and K8s gives me a way to use all of that nicely, without needing to build and maintain it myself, and without my applications needing to know everything.
> How did people ever manage to do anything before k8s?
Nobody ever said it wasn’t possible. Nobody is saying “you can’t run things the way you want”, the argument is “you keep saying K8s is bad, here’s some reasons why it’s maybe not”.
> I'm going to make the claim that a load balancer + asg + basic networking setup is all you need in 95% of cases.
If you can make it support the autoscaling Spark jobs, the jupyter-hub env the analysts use, and all our GRPC API’s, and have everything mTLS’d together, and have all of DNS+cert + routing magic with similar setup and maintenance effort, I’ll convert.
> Learning how to package services and have them run anywhere? A lost art.
The issue isn’t my teams code-if we were only running our own stuff, and we only used a single language, packaging would likely be straightforward. But it’s the fact that we run several other applications, of varying degrees of “packaged” and the fact that I can barely get the Python devs to use a virtualenv properly that makes this statement a bit too unreasonable in my experience. Containers may be messy, but at least I can run 3 different spark pipelines and 4 Python applications without needing to dig into the specifics of “how this particular app packages it’s dependencies” because that’s always an awful experience.
> If you can make it support the autoscaling Spark jobs, the jupyter-hub env the analysts use, and all our GRPC API’s, and have everything mTLS’d together, and have all of DNS+cert + routing magic with similar setup and maintenance effort, I’ll convert.
Let's define what you're talking about here and I can show you the way.
What is the difficulty in autoscalling Spark jobs?
What jupyter env do they use?
You say grpc api. That's genericm what do those services really do? Sync request/response backed by a db? Async processing? What infra do they need to "work"
Where are you running your workloads? Cloud? Which one?
> What is the difficulty in autoscalling Spark jobs?
I mean, running spark is an awful experience at the best of times, but let’s just go from there.
Spark drivers pull messages off Kafka, and scale executors up or down dynamically based off how much load/messages they have coming through. This means you need stable host names, ideally without manual intervention. The drivers and the executors should also use workload-specific roles - we use IRSA for this, and it works quite nicely. Multiple spark clusters will need to run, be oblivious to each other, and shouldn’t “tread on each other” so the provisioning topology should avoid scheduling containers from the same job (or competing jobs) onto the same node. Similarly, a given cluster should ideally be within an AZ to minimise latency, doesn’t matter which one, but it shouldn’t be hardcoded, because the scheduler (I.e. K8s) should be able to choose based on available compute. Some of the jobs load models, so they need an EBS attached as scratch space.
> What jupyter env do they use?
We use jupyterhub wired up to our identity provider. So an analyst logs on, a pod is scheduled somewhere in the cluster with available compute, and their unique EBS volume is automounted. If they go to lunch, or idle, state is saved, and the pod is scaled down and the EBS volume is auto-unattached.
> That's genericm what do those services really do? Sync request/response backed by a db? Async processing? What infra do they need to "work"
The API stuff is by far the easiest. Request response, backed by DB’s, the odd analytics tool and monitoring tool as well. Servers themselves autoscale horizontally, some of them use other services hosted within the same cluster, all east-west traffic within the cluster is secured with mTLS via linkerd, between that and our automatic metric collection, we get automatic collection of metrics, latencies, etc. like what you get with the AWS dashboards, but more detail (memory usage for one), automatic log collection too.
> Where are you running your workloads?
AWS, but minimal contact - S3 is basically a commodity API nowadays, the only tight integration is IRSA, which I believe GCP has a very similar version of. So most of this should work in any of the other clouds, with minimal alteration.
I find it amazing how people take the absolute shittiest practice from the past and compare it to k8s and reach the conclusion k8s is better.
Have you ever done IaC where the whole copy of the infra came up on deployment and traffic was shifted between old and new? And developers could spin up their own stack with one Cloudformation or Terraform command? Have you used cloud primitives that solve most of the problems you are reinventing with k8s?
Yep, that's the typical straw-man from K8S zealots: compare static, even bare-metal setups with theK8S API "revolution", forgetting that the "clouds" (both public and private) already had APIs that covered most of it.
Sure, there are some nice additions on top of that like "operators" that take advantage of the main loop you have already perpetually running in a k8s cluster, but the leap is not that astronomical as some says.
If you think those APIs don't need tons of scripts to be useful to a company, then either you do everything through the AWS console, or you are an anti-k8s zealot yourself.
There are many established solutions on top of those APIs, that let you manage them with code, DSLs, graphical UIs etc. (i.e. Terraform, Harness, Pulumi, Spinnaker etc)
I have done everything from deploying to Tomcat servers to modern k8s. I really don't like Cloudformation and ECS, arcane and poorly documented syntax. I like much more having a Helm chart for my apps. Have you ever tried to set up private DNS in ECS for services to talk? That doesn't look simple (or well documented) at all to me.
By the way, Terraform and k8s are not exclusive, I use them together.
I'm right now migrating from "anything before k8s" to k8s.
I can tell you how we managed to do anything.
We had few servers. They run docker-composes scattered all over people's home directories.
We had one nginx entrypoint. It's forbidden to restart it at work hours because it won't restart and there will be hours outage. You expected to restart it at weekend and then spend hours trying to start it.
Some docker images are built locally.
Backups are basically not worked before I came to this company. It was the first thing I fixed. Now we can restore our system if server dies. It'll take a week or two.
Kubernetes is going to be sane exit.
Yes, I know that it's possible to build sane environment. Problem is people don't bother to do that. They build it in the worse possible way. And with kubernetes, worst possible way is much better because kubernetes provides plenty of quality moving blocks which you don't need to reinvent. You don't need to reinvent nginx ingress with all the necessary scripts to regenerate its config, because you've got ingress-nginx. You don't need to invent certbot integration because you've got cert-manager (which is my personal favourite piece of k8s). You don't need to invent your docker-composes storage conventions, user permissions (777 everywhere, right), because k8s got you covered. All you need is to learn its conventions and follow them.
What commodity tool should I be replacing it with that gives me:
* super for heterogenous workloads (web servers, big batch loads, streaming workloads)
* autoscaling the workloads
* autoscaling the cluster
* auto-restarting workloads
* basically automatic networking
* the ability to have all that without the need to setup separate infrastructure, and to have the orchestrator move workloads around to optimise machine use
* the ability to hire people with experience in it already
* ability to run stateful workloads, and have disks managed and moved for you
* the ability to do all that by just providing a config
* the ability to do all that without having to let people have actual access to the machines/OS
* per container/workload topology constraints and requirements, that are network and data centre aware, without having to write any code
* have access to a wide range of tooling and docs
* be able to drop in workloads that automatically handle log collection, etc. without needing to configure each deployment.
* ability to swap out Network ingress/reverse proxy without having to redeploy or reconfigure your applications.
* ability to drop in any number of tools to do traffic shadowing, mirroring, splits, etc without reconfiguring or redeploying your applications
* ability to do all this without your devs needing to be networking experts.
* not locked to a specific cloud provider
I’m sure there’s tools out there that cover different subsets of those features, there’s probably tools that do all of them and more. K8s isn’t perfect, but it gives you an awful lot for some reasonable tradeoffs. Sure, if you’re deploying a single web-server/app, you shouldn’t bother with all this; and data centre level stuff probably has tools that make these features look like a cute picnic, but many devs I’ve worked with haven’t the capacity, ability or desire to figure that out.
> If you’re on a cloud provider? You’ve already got secure, restartable, restorable, on-demand-growable, discoverable, fail-over-able, imaginary machines a zillion times more robust
Sure, but you’ve still got to figure out how you want to setup, monitor, configure all of that, then stitch it together and maintain it, and also not get in the way of your main application development. To me, that’s what K8s gives me: I rely on the cloud provider to give me good machines, good networking, good storage primitives, and to heal/replace them when they fail, and K8s gives me a way to use all of that nicely, without needing to build and maintain it myself, and without my applications needing to know everything.