I’ve only read about Service Mesh, my impression was that it seems to add an awf...

kevan · on Feb 11, 2023

>slightly easier

As a company grows sooner or later most of these features become pretty desirable from an operations perspective. Feature developers likely don't and shouldn't need to care. It probably starts with things like Auth and basic load balancing. As the company grows to dozens of teams and services then you'll start feeling pain around service discovery and wish you didn't need to implement yet another custom auth scheme to integrate with another department's service.

After a few retry storm outages people will start paying more attention to load shedding, autoscaling, circuit breakers, rate limiting.

More mature companies or ones with compliance obligations start thinking about zero-trust, TLS everywhere, auditing, and centralized telemetry.

Is there complexity? Absolutely. Is it worth it? That depends where your company is in its lifecycle. Sometimes yes, other times you're probably better off just building things and living with the fact that your load shedding strategy is "just tip over".

davewritescode · on Feb 11, 2023

We’re the process of moving all of our services over to a service mesh and while the growing pains are definitely there, the payoff is huge.

Even aside from a lot of the more hyped up features of service mesh, the biggest thing Istio solves is tls everywhere and cloud agnostic workload identity. All of our pods get new tls certs every 24 hours and nobody needs an API key to call anything.

Our security team is thrilled that applications running with an Istio sidecar literally have way to leak credentials. There’s no API keys to accidentally log. Once we have databases setup to support mTLS authentication, we won’t need database passwords anymore.

bushbaba · on Feb 11, 2023

Some of the functionality you mentioned above is possible without a service mesh.

thinkmassive · on Feb 12, 2023

All of the functionality of kubernetes can be implemented independently. It’s still a useful set of abstractions, therefore/because it’s understood by a large portion of the industry.

dasil003 · on Feb 11, 2023

It’s 100% a question of scale. And I don’t mean throughput, I mean domain and business logic complexity that requires an army of engineers.

Just as it’s foolish to create dozens of services if you have a 10-person team, you don’t really get much out of a service mesh if you only have a handful of services and not feeling the pain with your traditional tooling.

But once you get to large scale with convoluted business logic that is hard to reason about because so many teams are involved, the search for scalable abstractions begin. Service mesh then becomes useful because it is completely orthogonal to biz logic and you can now add engineers 100% focused on tooling and operations, and product engineers can think a lot less about certain classes of reliability and security concerns.

Of course in todays era of resume driven development, and the huge comp paid by FAANGs, you are going to get a ton of young devs pushing for service mesh way before it makes sense. I can’t say I blame them, but keep your wits about you!

peteradio · on Feb 11, 2023

If you can convince your business folks to run shit on the command-line then there is basically no need for services ever. I know it sounds insane but its how it was done in the old days and there really is only a false barrier to doing it again.

emptysea · on Feb 11, 2023

Place I worked had support staff copy-pasting mongo queries from google docs -- worked in the early days but eventually you have to start building an admin interface for more complicated processes

When it was just mongo installs are easy since they only needed a mongo desktop client

peteradio · on Feb 11, 2023

Terminal can handle auth.

MoOmer · on Feb 11, 2023

Many of the use cases described in the post are solved by service meshes.

So, in my opinion, the questions are introspective:

- “Do I have enough context to know what problem those solutions are solving, and to at least appreciate the problem space to understand why someone may solve it like this?”

- “Do I have or perceive those problem to impact my infrastructure/applications?”

- “Does the solution offered by the use cases described appeal to me?”

If yes at the end, then one potential implementation is a service mesh.

A lot of these are solved out-of-the-box with Hashicorp’s Nomad/Consul/Vault pairing, for example!

remram · on Feb 11, 2023

It is true that a lot of those use cases are covered by "basic" Kubernetes (or Nomad) without the addition of Istio or similar, e.g. service discovery, load-balancing, circuit-breaking, autoscaling, blue-green, isolation, health checking...

Adding a service mesh onto Kubernetes seems to bring a lot of complexity for a few benefits (80% of the effort for the last 20% sort of deal).

campbel · on Feb 11, 2023

> Adding a service mesh onto Kubernetes seems to bring a lot of complexity for a few benefits

I think the benefits are magnified in larger organizations or where operators and devs are not the same people. And the complexity is relative to which solution you pick. If you're already on Kubernetes, linkerd2 is relatively easy to install and manage; is that worth it? To me it has been in the past.

nazka · on Feb 12, 2023

I like how you frame the questions. How many times people pick a technology without answering them? Even having some knowledge in them.

I am wondering does Nomad/Consul continue to scale after some level?

chromatin · on Feb 12, 2023

I don't know about Consul, but Nomad has been scaled to 2,000,000 containers on >6000 hosts

https://www.hashicorp.com/c2m

jrockway · on Feb 11, 2023

It's a "big company" thing. In my opinion, the best way to add mTLS to your stack is to just adjust your application code to verify the certificate on the other end of the connection. But if the "dev team" has the mandate "add features X, Y, and Z", and the "devops team" has the mandate "implement mTLS by the end of Q1", you can see why "bolt on a bunch of sidecars" becomes the selected solution. The two teams don't have to talk with each other, but they both accomplish their goals. The cost is less understanding, debuggability, and the cost of the service mesh product. But, from both teams' perspective, it looks like the best option.

I'm not a big fan of this approach; the two teams need to have a meeting and need to have a shared goal to implement the business's selected security requirements together. But sometimes fixing the org is too hard, so there is a Plan B.

davewritescode · on Feb 11, 2023

I very much disagree the sentiment that adding mTLS is just “verifying the certificate on the other end of the connection”. You ignore the process of distribution and rotation of certificates which is non-trivial to implement application side.

jrockway · on Feb 12, 2023

I honestly thought about covering a few ideas in the post, but decided it was off topic. The service meshes do include some rudimentary key generation and distribution code, which is nice to not have to build yourself. The simplest thing, if you're deployed in k8s or similar, is cert-manager + a CA + code that reloads keys when the secret is updated (pretty easy to write). This has downsides (good luck when your CA expires!) but it is easy and does keep itself functional. Cloud providers also have a service like this, which protects the root key with their own IAM (and presumably dedicated hardware); it's definitely a route you'll want to look into.

What's missing are a bunch of things you probably want to check before issuing keys; was the release approved, was all the code reviewed before release, is the code reading the foo-service key actually foo-service? That involves some input from your orchestration layer; i.e. an admission controller that checks all these things against your policies, and only then injects a key that the application can read. (Picking up rotated keys becomes more difficult, but this might be a good thing. "If you don't re-deploy your code for 90 days, it stops being able to talk to other services" doesn't seem like the worst policy I can think of in a world where Dependabot opens up 8 PRs a day against your project.)

This all has the downside that it doesn't really prevent untrusted applications from ruining the security; a dump_keys endpoint that prints the secret key to a log, nefarious code checked into source control but approved (perhaps due to a compromised developer workstation), etc. Fixing those problems is well outside the scope of a service mesh, but something you have to have a plan for. CircleCI didn't! Now you read 3 blog posts a day about how they got hacked.

Anyway, not sure where I was going with this, but application teams need to consider their threat model and protect against it. Security isn't a checkbox that can be checked by someone that didn't write the code. Sure, you can get all sorts of certifications this way that look nice on your marketing page, but the certifications really only cover "did they do the bare minimum to look kind of competent if it was 10 years ago". If you have sophisticated adversaries, you're going to need a sophisticated security team.

alasdair_ · on Feb 11, 2023

Can’t each service just have a job that calls the Let’s Encrypt api once a day to get a new cert?

jagged-chisel · on Feb 11, 2023

Most of my programming peers want to focus on solving product-related problems rather than authe, authn, tls config, failover, throttling, discovery…

We want to automate everything not related to the code we want to write. Service meshes sound like a good way to do that.

Scubabear68 · on Feb 11, 2023

Right - by why not use something like an API gateway then?

steviesands · on Feb 11, 2023

API gateways are primarily used for HTTP traffic coming from clients external to your backend services eg. an iOS device (hence the term 'gateway' vs. 'mesh'). I don't think they support thrift or grpc (at least aws doesn't, not sure about other providers). https://aws.amazon.com/api-gateway/

alasdair_ · on Feb 11, 2023

Google cloud supports grpc on their api gateway: https://cloud.google.com/api-gateway/docs/grpc-overview

pbalau · on Feb 11, 2023

That can work, but it means you simply outsourced the problem to AWS. It's not a bad idea per se, but it means your service needs to talk, in some way, http.

You could use the service mesh thing from AWS, along with cognito jwts, for authenticatetion and authorization

Too · on Feb 12, 2023

You can easily self host your own proxy. I bet API gateway is just Nginx, Traefik or HAProxy under the hood anyway.

tyingq · on Feb 11, 2023

I suspect if a Service Mesh is ultimately shown to have broad value, one will make it's way into the K8S core.

To me, it's a fairly big decision to layer something that's complex in it's own right on top of something else that's also complex.

jpdb · on Feb 11, 2023

> I suspect if a Service Mesh is ultimately shown to have broad value, one will make it's way into the K8S core

I'm not so sure. I suspect it'll follow the same roadmap as Gateway API, which it already kind of is with the Service Mesh Interface (https://smi-spec.io/)

jspdown · on Feb 11, 2023

Indeed, all major Service Meshes solution for Kubernetes implements (at least some part) the SMI specification. There is a group composed of these players working actively on making such spec a standard.

Understanding these few CRDs give great insights on what do expect from a Service mesh and how thinks are typically articulated.