I work at a large company that uses ATS heavily (top 5 site). There have been huge improvements in performance and functionality after this paper has been written.
In his benchmarks he was running with a single volume configured for cache, which would have a global lock on cache. If he partitioned the cache into multiple volumes (something we do by default now) he would have had much lower cache hit response times.
The majority of our cache hit response times in production are less than 1 ms.
In the benchmarks I have run ATS has always been faster then Varnish and NGiNX. If they weren't I would have made changes to ATS to make it faster.
It's obviously working well for you but I'm curious how you deal with the circular buffer cache with ATS. For a large library that would seem to be an immediate disqualifier. Or maybe I'm not understanding how that works.
I'm assuming as a top 5 site you probably deal with a large library, and appear to be ok with ATS despite that. Comments?
The Tornado Cache (FIFO) hasn't really been an issue as an eviction algorithm. Most of our caches are sized to hold over a weeks (most over a months) worth of objects in cache. Most objects/traffic is temporal in nature. The popular images and videos are normally only popular for a certain time period.
We have looked at not evicting objects on disk if they are in the RAM cache and that has a LRU like eviction algorithm (really it is a CFLUS). Doing this would help in not evicting really popular objects.
FIFO has advantages over LRU for disks. It is very efficient with writes since they are all sequential. We use rotational disks when building out very large second tier caches.
There are other things to consider when looking at cache in a proxy server. How many bytes does the in memory index take per object in cache (for ATS 10 bytes and that is extremely efficient). Also, does the cache use the filesystem and/or use sendfile for HTTP (like NGiNX), but can't use sendfile when using HTTPS or HTTP/2. Netflix is experience this pain when moving to HTTPS with NGiNX.
Every proxy server some advantage, easy of use, well supported APIs, flexible configuration, dynamic loadable modules, HTTP specification compliance, HTTP/2 support, TLS support, performance, etc. It really depends on what you are looking for when choosing a proxy server.
Well, that's true for any cache. The choice of a simple eviction algorithm in ATS is deliberate, and usually yields better cache efficiency than more complex architectures.
Fwiw, it does support cache pinning, but that's rarely used nor necessary.
ATS uses a tornado cache, where the write pointer just moves to the next object no matter what. So the disk cache doesn't work as a LRU in the same way as other cache servers.
The benefit is that writing is fast and it's constant time since you don't have to do an LRU lookup to pick a place to store the object. The downside is that you are creating cache misses unnecessarily.
It's never really been a problem for me in practice. If you have a lot of heartache over it, I would suggest putting a second cache tier in place. Very unlikely to strike out on both tiers.
> The downside is that you are creating cache misses unnecessarily.
Statistically, it balances out just fine. It turns out that just by controlling how objects get in to the cache, you can effect cache policy enough that eviction policies don't much matter, or at least, a "random out" isn't much different from a "LRU".
To avoid unnecessary cache writes, there's also a plugin that does implement a rudimentary LRU. Basically, you have to see some amount of traffic before being allowed to get written to the cache. This is typically done in a scenario where it's ok to hit the parent caches, or origins, once or a few times extra. It can also be a very useful way to avoid too heavy disk write load on SSD drives (which can be sensitive to excessive write wear, of course). See
I believe the term youre looking for is "cache admission policy." This is an adjunct to cache eviction, both are needed for success. I'm very curious what a highly efficient insertion policy and trivial "eviction" policy (FIFO) would look like in practice.
PS: If anyone is interested in these problems, We're Hiring.
edit: https://aws.amazon.com/careers/ or preferably drop me a line to my profile email or my username "at amazon.com" for a totally informal chat (Im an IC, not manager nor recruiter nor sales)
Yeah, with SSD's I wonder how much that really helps to improve performance vs. just no cache. Most SSD's have a lot of caching implemented internally, so disk cache can often be self defeating.
"It Depends." If youre doing "random" writes down to the block dev, like updating a filesystem, it can be very bad. You'll end up hitting the read/update/write cell issues and block other concurrent access. In general I'd worry (expect total throughput to go down, and tail latency way up) around a 10-20% write:read ratio. Conversely if youre doing sane sequential writes, say log structure merges with a 64-256KB chunk size, Id expect much less impact to your read latencies.
Comparison with Varnish
from "Performance Evaluation of the Apache Traffic Server and Varnish Reverse Proxies" (2012) [1]
... the results indicated that Apache Traffic Server reached better cache hit rates and slightly better bandwidth throughput with the cost of higher system and network resource usage. Varnish on the other hand managed to response higher request rates with better response time, especially for the cache hits. The findings in this thesis indicates that Varnish seems to be more promising reverse proxy.
Our CDN at Netlify (https://www.netlify.com) is based on traffic server and it's powerful plugin engine.
I've used both Squid, varnish and nginx plenty, but traffic server beat them in our benchmarks and the built in ssl termination + plugin api makes it extremely powerful...
As others have indicated, it's a proxy server rather than a general purpose webserver. There's no code relation between the two servers; it's simply that the team at Yahoo chose to pursue it as an Apache Foundation project when they open sourced it.
ATS scales orders of magnitude better than Apache, due to its process model. Whereas at Yahoo we would budget between 30-200 simultaneous connections per Apache server (prefork), the proxy service which I ran using ATS was budgeted for over 100,000 concurrent connections per machine.
It's significantly less featureful than Apache, but it does caching substantially better than any other cache server commonly available (nginx, apache, squid, varnish).
"Apache, due to its process model."... "(prefork)"
Seems to me that the above is all based on reflections from decades ago w/ Apache httpd 1.3. Right now, all web servers can handle similar levels of concurrency with the bottleneck being the network pipe itself.
ATS is a great platform; using it in combo w/ Apache httpd (2.4) allows a pure open source implementation with all the power, speed, reliability one could want, and protection against Open Core business models.
Uh well, it wasn't "decades ago" but it was some old timey stuff. Yahoo used Apache 1.3 as recently as 2012. They also disabled Keep-Alive on Apache, and most properties would use the hardcoded default number of prefork processes (32). It wasn't the smartest setup.
Nevertheless, I don't think that nginx / Apache / Varnish / haproxy / etc. are able to handle similar concurrent connection levels as ATS without significantly impacting 95th percentile latency due to their core architectures.
Hey, nice seeing you here! Glad to see the project reach HN (finally). That indeed is a very good slide deck.
Some features that I preach are:
- good turnkey default values
- lua support
- config options galore (Bryan labels it a con, but if you want control it's perfect)
- good logging
- historically proven scalability on large smp, xxlarge memory, multi nic systems
Edit: forgot to add one more thing though maybe not worthy a bullet point. If possible, a preference for physical rather than virtual is where I've seen performance with ATS shine. That is one reason why you would want as much config control possible.
Yeah; I find many people in the modern web services world end up using Apache HTTPD as a caching proxy for availability reasons. It's probably just because it's old as hell so everyone knows the config file format by heart, but I know I went with Apache HTTPD on a recent project because the features we needed were only in the paid version of Nginx and we just didn't want to deal with licensing (the dollar amount was trivial in corporate dollars, but it would have taken us months to get the purchase through procurement and the entire rest of the project used open source software). So it's good to have another open source option to keep in mind.
Application health checks mostly; though we were using it in a caching server context (static files and non-whitelisted endpoints served directly from Apache vs. hitting the app servers). They're available in Nginx Plus, just not the free version.
Again, the pricing wasn't the issue; it was the fact we would have had to go through procurement - which involves a few weeks/months of process at any decently large company. So we ended up using Apache because the team was familiar with it and knew it could support our use case. While Nginx probably would have performed better, in the end it was just easier to use Apache because it was a mature Open Source project and throw a few extra VMs at the caching tier to make up for the performance gap.
> Can someone explain how this is different than how most people use Apache HTTPD these days
You mean as a reverse proxy / cache server?. I don't have statistics but I would think that most people use apache as a regular http server (serving files or as part of a *AMP stack)
Edit: I was serious about that; I was under the impression that most stacks use much lighter weight http daemons than Apache these days. I understand that legacy apps are still out there and not everyone is going to refactor, but anyone developing web applications under Apache in 2016 is just a glutton for punishment...
Change this to "People still use X?" where the value of X is pretty much any technology you've ever heard of, dating back to the 1970's (if not before). And the answer will, to a first approximation, always be "yes".
Now the number of people using X might be small, but you can all but bet your life that somebody, somewhere is, indeed, still using it. And depending on what it is, you might be surprised as how large the number actually is. Keep in mind, HN and Reddit, etc., comprise something of an echo chamber, where people of a certain mindset and orientation flock. The world is MUCH larger.
No, RPG on iSeries /AS400 machines isn't "cool" and you won't see it mentioned on HN much (if at all) but this stuff is still used all over the place. LAMP stacks? Yeah, still widely used. OS/2? Not exactly "widely" used, but still used. COBOL? Yep. Fortran? Yep. MVS? Yep. And so on. Now, granted, this stuff isn't used by "hip startups" or "unicorns", but the world is a lot bigger than the SV startup scene.
Technologies die VERY slowly for whatever reason, at least in regards to the "long tail" (so to speak) of the usage curve.
I know a hosting company that exclusively deploy Apache, because that's what they know how to configure. Everything that works under nginx, will still work under Apache. If they need more performance, they put Apache behind a Riverbed Stringray/SteelApp/Brocade Trafic Manager (or whatever it's call these days).
I think people are being to hard on Apache, it's still a great webserver for a ton of applications. That being said I prefer to configure Nginx over Apache.
Also, if Apache is configured well, using the Event MPM (or even the older Worker) and sane thread counts for the server it's running on, it's a lot faster than it used to be. I can't say how fast compared to Nginx because literally every comparison between the two I've seen has hamstrung Apache by using the ancient Prefork MPM (and I haven't done a rigorous comparison myself), but I expect it's at least on the same order of magnitude.
Not really at small scale, but if you're building a large service there are several advantages.
For one thing, not having to copy response data between processes improves throughput. Since Varnish is so resistant to supporting SSL natively, you'll always have to place something in front of it to use it with the modern web. Whether it's haproxy, Apache or nginx, that's just one more thing to deal with.
I have some other beefs with Varnish, but the most annoying one is the absence of a persistent disk cache. If the Varnish process dies, there goes your disk cache. Even though cache data is written out to disk, Varnish punted on saving an index and re-using an old process's cache, so it writes the cache to an unlinked file.
Imagine a bad code push or new traffic pattern that causes core dumps across your entire service footprint -- and now it isn't just a problem of getting the process back up and stable, you have also lost hundreds of terabytes of cache data. Or something as simple as rolling out a new version. You can architect around the problem, but why should you even have to?
ATS also (recently) supports Lua for plugins, which is way more powerful than VCL. It is a finicky piece of software though, and there are a lot more sharp edges that you're likely to cut yourself on during the initial honeymoon period versus Varnish.
Varnish (just like Nginx) is putting key features behind a paywall. That's one of the reasons I personally want to consider Traffic Server. Plus, it's been around for ages, has a great architecture, and a great track record as well. All it needs is a little more awareness and that's why I keep posting it here. :)
I sympathize with the authors of Varnish and nginx; their software is used all over the place, and they want to make a living at it. I just don't want to support that kind of business model, and I'm never dealing with per-server license compliance again.
I wish more companies would model themselves after Percona: charge for support, custom engineering and on-call -- don't fork or paywall any code.
ATS suffers by comparison, since there is no "ATS Inc." to provide support and engineering work. There's OmniTI, but I don't have first hand experience with their service to say if it's worthwhile or not. They did get paid to write the current ATS docs, so presumably they know what they're doing.
I wish ATS got more attention, but it is after all a bit of a niche product hidden away in the Apache Foundation with a bunch of unrelated Java projects. It's too fiddly for small scale use, and once you hit large scale you're pretty much hiring someone from Yahoo or elsewhere that has experience running and developing it (for example: I'd like to hire ATS people). Doesn't give it a lot of opportunity to trickle into smaller shops and grow with their service.
There are right and wrong ways to monetize. Putting basic features behind a paywall and per-server licensing as you pointed out is not something I can live with. Even if I don't use these features, I feel I'm using a subpar product, and this makes me look for alternatives such as ATS and H2O [0].
I disagree that it is not expensive. If you're launching a single instance then yes, it's tolerable: $0.21/hr or $1839 per year (very similar to their non-AMI pricing). Anything is tolerable at a low scale.
Now think about services that use nginx on every machine as a general purpose URL API interface. It's not uncommon, why bother re-inventing the HTTP server wheel. At a previous company, a service I ran would have cost $3.6 million a year in nginx plus licensing fees. Almost none of the added 'plus' features would have been at all useful.
If you see nginx plus as a way to pay for the core software then sure, maybe that cost is appropriate. I will not support them with per-server licensing of gated off features. Oh and by the way, nginx plus is closed source and is only available on a small handful of platforms, and doesn't always maintain the same release schedule as the open source version, all as a way of supporting their licensing scheme.
I would support nginx via professional services and support fees, but only in conjunction with the open source release. So it's up to them if they want that money or not.
Let me ask you this, most of the "premium" Nginx Plus features are specific to load balancing. You shouldn't need more than 2 or 3 load balancers right? I.E. load balancer in each availability zone or multiple regions? Beyond that use Nginx open source. This is exactly what I do.
I get what you're saying, but now you have two divergent nginx releases to maintain. nginx plus isn't simply nginx with an extra module, it is closed source and on a separate release train.
For instance, the last nginx plus release (R8) is based on nginx 1.9.9 (plus was released 40 days later). The previous is based on nginx 1.9.4 (plus was released 25 days later). It isn't a matter of life and death, but it is an annoyance and unnecessary.
At least with Varnish Plus, it's still the same open source server with proprietary modules added in. I understand that nginx wants to go that route eventually, but they're hamstrung by the lack of dynamic module loading for the moment.
Apache Traffic Server has support for native C++ plugins which allow you to make transformations to the headers and the body as the request and response passes through it. ATS's power comes mainly from these plugins, I would say, rather than just a reverse caching proxy.
Huh. You could conceivably replace ISA/TMG (Microsoft Forefront) with ATS. There are still places using those products as load balancers / reverse proxies even though support is almost completely gone.
Likely because ATS is considered 'difficult' whereas the others are 'easy'. I'd argue that if you are running a serious site, you need expertise engineering regardless of which software you choose.
At a company I was at a while ago we used/abused ATS _very_ heavily - apart from the pain of a comparatively obscure tool, it was great.
It took us some digging and work to get configured exactly right, especially since we were using it fairly nonstandard - as a caching forward proxy to external data sources.
Some people do use nginx as a caching server. CloudFlare, for example, is built on top of it.
The reason to do so is because you want the rest of what nginx provides, not to get its caching module. It is an extremely barebones solution that only solves the most basic requirements. I can only presume that CloudFlare and others have written their own caching modules for nginx.
A short list of annoyances:
1. No support for multiple disk devices. Files are written to a fixed temp path and then renamed to their real destination. So you need to use RAID to present the disks as one logical device, which is a wholly unnecessary expense in a caching environment.
2. No support for purging in open source. This is an nginx plus feature, which starts at $1900/server/year.
3. Because of the temp file / rename thing, support for streaming subsequent requests off of the first request that is filling cache is janky. Subsequent requests have to acquire a lock.
4. No support for any fancier cache setups, utilizing ICP / HTCP.
It makes sense, considering that nginx is typically ahead of the pack on support for things like SPDY and H2. Plus if you use OpenResty with its Lua functionality, you can do a lot of fancy things in nginx and reduce your dependency on Varnish's VCL. And you have to have something in front of Varnish to do SSL anyway.
VCL in particular is kind of a trap. Early on it can do what you need -- remove a header, set a header, basic branching. Then you want to do basic arithmetic, or validity checking, or anything that isn't suitable for string assignment or regex and you straight up can't do it. VCL makes me long for the power of bash scripts.
Ultimately though, it's really not a great solution to separate these concerns between multiple applications. You're going to get bitten somewhere, even if it's just the old ephemeral port exhaustion problem.
That is true. I was particularly interested in tag-based cache purge and although there are similar open-source Nginx modules [0] and [1], they still don't have that out of the box.
Tag-based cache purges are something I would love to see in ATS. I think doing it correctly would require a complete rejiggering of the cache storage though, and that's not something to be undertaken lightly.
Storing externally in redis (for ledge) seems like the wrong approach to me. Better to store metadata externally and generate the purge URLs based on that. It's not ideal, but it's the best option I've come up with.
Initially created by Inktomi, bought by Yahoo! and got open sourced and brought to Apache Foundation in 2009 because of the good experience of Yahoo! with Hadoop in 2008/09.
In his benchmarks he was running with a single volume configured for cache, which would have a global lock on cache. If he partitioned the cache into multiple volumes (something we do by default now) he would have had much lower cache hit response times.
The majority of our cache hit response times in production are less than 1 ms.
In the benchmarks I have run ATS has always been faster then Varnish and NGiNX. If they weren't I would have made changes to ATS to make it faster.