Awesome blog post! I very much enjoy hearing how large web properties implement these technologies and any issues they experience along the way.
Are you using envoy at all in your main http ingress path? You mentioned haproxy and AWS ELBs, but it wasn't clear if envoy is also being considered for public ingress traffic.
We have not yet put Envoy in our main HTTP ingress path, but internally we have designs and implementation paths ready to go, and it's definitely being considered for public ingress traffic. As we noted in the last "teaser" section of the post we'd really like to leverage Envoy's routing functionality to facilitate migrating client-facing APIs in the backend without affecting frontend interfaces.
Our HAProxy layer that routes ingress traffic to the core backend infrastructure has considerable routing logic that can be moved to Envoy and then further extended. We'd love to explore that path in the coming months.
I look forward to hearing more about your plans for ingress and how the various pieces fit together (CDN, L4/L7 LBs, TLS termination, Geo/policy DNS balancing). Especially regarding the performance and new features available using Envoy. I've use HAProxy before and it was great for simple routing/reverse proxy but not so great at complex/dynamic configuration or cert management.
HAProxy supports quite complex configurations. We've actually found that many of our users are only realizing extremely basic capabilities so we have been working on increasing our blog content to help them take advantage of some of the more complex configurations that can be done. We've even found that many users are not aware that HAProxy now supports Hitless Reloads [1].
Quite a bit of complex routing and dynamic configurations can be provided by map files [2] and these and many other settings can be updated directly from the Runtime API [3].
With that said -- we are actively working to make things even better and intend to introduce support for updating SSL certificates/keys directly through the Runtime API as well as introducing a Data Plane API for HAProxy.
We have a new release coming any day now and this will lay the foundation that will allow us to continue to provide best-in-class performance while accelerating cutting edge feature delivery.
Yes! HAProxy is a terrific piece of tech, and has been awesome for our use cases so far. We do quite a bit with it for our main ingress routing and it was basically flawless as our data plane in SmartStack.
I'm really excited about what we're building out for next year and can't wait to share as well. Feel free to reach out on reddit (u/wangofchung) or directly at courtney.wang@reddit.com for more in-depth discussion!
1. Have you considered/are considering ISTIO control plane for your Envoy fleet? Why or why not?
2. Did you containerize your applications before using Envoy? The blog post talks about running them on autoscaled ec2 instances but its not clear if you're running application binaries on those vms or serving from containers
1. We are considering Istio! This is especially true for our Kubernetes environment. We are already planning to deploy Pilot for the first iteration of our control plane in our non-K8s environment, so the other pieces that comprise Istio is a natural place for us to continue exploring.
2. We have not containerized prior to Envoy. We're running application binaries provisioned with Puppet on EC2 for most of our infrastructure still.
We run one proxy per machine, even when there are multiple services running. The proxy is just an abstraction to the downstream dependencies. Even if there are multiple services per machine, they can still reach downstream services via the same proxy path.
Thanks for all the answers, I really appreciate it! I've got one more question.
In the period you had parts of your system with envoy and parts without, have your routed the outbound traffic from envoy-equiped services through their local proxy before reaching its envoy-less destination? Or did you omit envoy then?
We route all outbound traffic from internal services through Envoy, even if the destination isn't running Envoy. We don't have envoy running as a "front" proxy right now, i.e. our L4 setup isn't Envoy <-> Envoy, it's Envoy -> service directly. An example of this is the DB layer - traffic going to our DBs from services goes through Envoy service-side but Envoy isn't running on our DB instances.
What are the reasons for your to have chosen to do so? I mean going for a "back" proxy instead of envoy-envoy (which seems to be the most "advertised" approach) or "front" proxy. As I understand, this way you're losing the envoy features for your most "shallow" service. Or do you also run envoy on your ingresses?
The "back" proxy was the initial setup with SmartStack, so we went with that for minimal viable first steps. We wanted to make incremental changes, changing as little as possible, for this migration so we could monitor for correctness and performance at every step. The eventual plan is to run Envoy as a front proxy for ingress, and maybe even Envoy <-> Envoy everywhere, where we have Envoy as both a front and back proxy on every service deployment (instance, container, etc.)
Others have mentioned that there are some gotchas with Envoy, and you mention a few about the migration bumps. Did you encounter other gotchas? And do you have any suggestions on how to avoid/mitigate their impact?
but as the above indicates, they were resolved _very_ quickly.
The most important thing when making a transition like this is to have as much monitoring and observability as possible without the new tech. We were able to quickly identify and respond to issues we had with Envoy based on existing application and system instrumentation that weren't directly provided by Envoy, along with the vigilance of our engineering team.
Hey, I myself am planning to introduce envoy into an existing mixed kubernetes/bare-metal architecture, having the same "one service at a time" considerations.
Have you been thinking about adopting istio? If yes, why didn't you?
We're currently evaluating the pieces that comprise Istio, both within Kubernetes and outside of it in our existing infrastructure.
We didn't do so immediately because we did not want to immediately update all of our technology at once and felt that a piece-wise migration would be both the least disruptive to our infrastructure and safest. I think of Istio as like Smartstack in that it's not actually a complete "thing" so much as a suite of technologies that can be individually evaluated and deployed. It's very easy to fall into the trap of wanting to do everything at once, and we opted to make small progressive steps for this initiative.
Yeah, I'm thinking of going with only the minimal istio installation (pilot only), not to roll our own, and setting it up with our existing consul service Discovery.
The more logic we push out of band into sidecars, the harder application issues become to debug. For example, let's say an Envoy config change is made centrally and all of sudden my app breaks because an HTTP header has stopped being set. Before I would easily be able to write a unit test to fix such a thing. Now I would need to replicate the envoy config in a test environment etc.
The sidecar model is different to integrating with a 3rd party API as it is designed to operate transparently. My Integration test might be passing but when running with the sidecar, traffic can be mutated etc
if you arent familiar with the "jargon" in this post then you're probably not in a position to reasonably judge whether its "overly complex". Most everything in here would be familiar to anyone with experience working on modern high volume web properties.
Regardless of whether Service Mesh is overly complex, the industry seems to have entered an era of "Complexity Worship". I was speaking to some engineers at a small startup the other day with only a handful of customers. They have invested significant resource building there own K8S cluster, ensuring it runs on multi-cloud etc, sounds a lot like premature optimisation.
I have a theory that Complexity Worship is a product of boredom by CS degrees who would rather not spend their time implementing WYSIWYG editors or whatever anymore.
That's insane. I feel like at a small scale you get a product out the door and refine it until you start growing. It's really not hard to use ansible to deploy your stuff onto EC2 and add more nodes as you grow.
But, K8s is the new hotness and people are going to use it.
The original architecture seemed overly complex IMO. Maintaining a fleet of HAProxys and their configs seems daunting... it seems like they're gaining both flexibility (maintaining configs easily) and observability (request/response metrics for thirft requests)
Per-service proxy deployments are a bit complex for the infrastructure but provide a nice abstraction for the service and service developers themselves. The configuration scheme is indeed daunting, which is what we're hoping Envoy and its xDS APIs + centralized configs can help us solve for developer teams.
Which I think is kind of OP's point, i.e. that maybe reddit is focusing its resources on the wrong issues. Afaik the website was and still is generally responsive from a networking point of view, there are some 503s here and there from time to time but nothing that could throw me away as a regular user, on the other hand the redesign (if it manages to overwrite all the present ways of getting past it, such as using old.reddit or i.reddit) will definitely turn me away as a regular user, and I say that as a really long-time user of that website.
The more general issue is that there are a lot of technical people around SV that do, well, technical stuff, because they are really, really good at what they're doing (so I'm in no way downplaying this post). The issue is that focusing only on technical stuff and ignoring how users actually use your website/product might turn those users away and you're left with a technical behemoth which has no users (see Google+ for a relatively recent example).
I would be very surprised if the people working on UI/UX and the people working on network visibility and routing were the same group. I suspect you are mentally constructing a zero sum situation where there is not one. I don't use reddit much, or know that much about it, so I maybe be wrong, but unless they are a very small shop those are generally pretty separate disciplines.
> I suspect you are mentally constructing a zero sum situation where there is not one.
Generally speaking companies do have limited economic resources and also generally speaking yes, there is "a zero sum situation" when it comes to allocating resources inside a specific company. The management doesn't generally have infinite time and resources at its disposal and a focus on the technical side of things (or on any other specific side of a particular business) has many times resulted in neglecting the rest of the business not related to that particular topic.
A very good such example is the same Google+ case, with the now famous motto "all arrows pointed in the same direction" or some such which imho made them lose focus on a ton of other important stuff going on at the same time (Amazon and aws, most of the google products have become a chore to use since then etc)
I'm struggling to see your point. A company at a size that I assume reddit is is going to have dedicated infrastructure engineers and dedicated UI/UX/design people. They are almost definitely separate groups, regardless of the fact that they're paid from the same overall company budget, and will have separate projects. Infrastructure engineers working on infrastructure has no impact on UI engineers working on UI, unless perhaps they have Infrastructure engineers work on UI in their down time. Perhaps that is the case though and that's why everyone is so unhappy with the UI?
Google is a lot larger than Reddit is but it nevertheless blew it by focusing on mainly one thing, Google+. I’m saying that instead of focusing on this tech reorg which seems kind of overkill (at least from an user’s perspective) they could have given more time to focusing on the redesign, because at this point the redesign looks like it has received no significant input from management at all, it’s a total disaster.
If a company has significant users, ops/site reliability is one area where I’d be hesitant to say they’re overfunding, especially from the outside. Especially at reddit’s growth trajectory, if you’re sitting still, you’re probably falling behind.
I'm genuinely curious about their growth trajectory lately, meaning the last 6 months to one year, i.e. if they continued on their almost exponential upward trajectory that has been manifest in the last 4-5 years. Either way, I really do think that them blowing up the redesign is an existential threat for them, and I say that as a reddit user active on that site since before the digg blowout and following exodus.
Pretty interesting, thanks! Comparing it to the data from 2017 [1] it looks like a ~30% increase in the number of comments (from 900 million to 1.2 billion), while unfortunately the number of votes/upvotes provided for the two periods is not consistent: they mentioned 12 billion upvotes for 2017 and 27 billion votes (which presumably included downvotes) for 2018. All in all not bad, let's see what 2019 will bring to them.
Definitely, I hate the redesign and use old.reddit.com when I do visit, but what their ops team does on the backend to scale the site seems tangential to that.
We're actually looking to put Envoy in front of the redesign stack at some point in the near future! The major services backing the redesign can be isolated into a few smaller pieces, and we'd like to have Envoy be a routing layer that can abstract this for the central browser client as we evolve the backend.