Apache Traffic Server

lumanaughty · on Jan 28, 2016

I work at a large company that uses ATS heavily (top 5 site). There have been huge improvements in performance and functionality after this paper has been written.

In his benchmarks he was running with a single volume configured for cache, which would have a global lock on cache. If he partitioned the cache into multiple volumes (something we do by default now) he would have had much lower cache hit response times.

The majority of our cache hit response times in production are less than 1 ms.

In the benchmarks I have run ATS has always been faster then Varnish and NGiNX. If they weren't I would have made changes to ATS to make it faster.

eggnet · on Jan 28, 2016

It's obviously working well for you but I'm curious how you deal with the circular buffer cache with ATS. For a large library that would seem to be an immediate disqualifier. Or maybe I'm not understanding how that works.

I'm assuming as a top 5 site you probably deal with a large library, and appear to be ok with ATS despite that. Comments?

lumanaughty · on Jan 29, 2016

The Tornado Cache (FIFO) hasn't really been an issue as an eviction algorithm. Most of our caches are sized to hold over a weeks (most over a months) worth of objects in cache. Most objects/traffic is temporal in nature. The popular images and videos are normally only popular for a certain time period.

We have looked at not evicting objects on disk if they are in the RAM cache and that has a LRU like eviction algorithm (really it is a CFLUS). Doing this would help in not evicting really popular objects.

FIFO has advantages over LRU for disks. It is very efficient with writes since they are all sequential. We use rotational disks when building out very large second tier caches.

There are other things to consider when looking at cache in a proxy server. How many bytes does the in memory index take per object in cache (for ATS 10 bytes and that is extremely efficient). Also, does the cache use the filesystem and/or use sendfile for HTTP (like NGiNX), but can't use sendfile when using HTTPS or HTTP/2. Netflix is experience this pain when moving to HTTPS with NGiNX.

Every proxy server some advantage, easy of use, well supported APIs, flexible configuration, dynamic loadable modules, HTTP specification compliance, HTTP/2 support, TLS support, performance, etc. It really depends on what you are looking for when choosing a proxy server.

eggnet · on Jan 29, 2016

Sounds like if you can size the cache large enough to hold enough data to keep cache miss under control, ATS is solid. Thank you for sharing!

lhedstrom · on Jan 29, 2016

Well, that's true for any cache. The choice of a simple eviction algorithm in ATS is deliberate, and usually yields better cache efficiency than more complex architectures.

Fwiw, it does support cache pinning, but that's rarely used nor necessary.

jjm · on Jan 29, 2016

+1 In the long run efficiency is very high. Wonder if a set of test data to demonstrate this as a KPI could be added other than raw benchmark perf.

cbsmith · on Jan 28, 2016

Caches generally need to be bounded to be effective. What am I not getting here?

skuhn · on Jan 28, 2016

ATS uses a tornado cache, where the write pointer just moves to the next object no matter what. So the disk cache doesn't work as a LRU in the same way as other cache servers.

The benefit is that writing is fast and it's constant time since you don't have to do an LRU lookup to pick a place to store the object. The downside is that you are creating cache misses unnecessarily.

It's never really been a problem for me in practice. If you have a lot of heartache over it, I would suggest putting a second cache tier in place. Very unlikely to strike out on both tiers.

cbsmith · on Jan 29, 2016

> The downside is that you are creating cache misses unnecessarily.

Statistically, it balances out just fine. It turns out that just by controlling how objects get in to the cache, you can effect cache policy enough that eviction policies don't much matter, or at least, a "random out" isn't much different from a "LRU".

lhedstrom · on Jan 29, 2016

To avoid unnecessary cache writes, there's also a plugin that does implement a rudimentary LRU. Basically, you have to see some amount of traffic before being allowed to get written to the cache. This is typically done in a scenario where it's ok to hit the parent caches, or origins, once or a few times extra. It can also be a very useful way to avoid too heavy disk write load on SSD drives (which can be sensitive to excessive write wear, of course). See

https://github.com/apache/trafficserver/tree/master/plugins/...

donavanm · on Jan 29, 2016

I believe the term youre looking for is "cache admission policy." This is an adjunct to cache eviction, both are needed for success. I'm very curious what a highly efficient insertion policy and trivial "eviction" policy (FIFO) would look like in practice.

Cache insertion research is generally focused on use cases like small associative hardware caches. There's very little applicable public research for larger software caching systems, that Ive found. Probably the best would be Gil Einziger. He appears to have found it as an application of his work on extremely space/time efficient counting of sets, http://www.graduate.technion.ac.il/Theses/Abstracts.asp?Id=2... and http://www.cs.technion.ac.il/~gilga/. Of notable mention is TinyLFU http://www.cs.technion.ac.il/~gilga/TinyLFU_PDP2014.pdf. Gil submitted it to Caffeine (Java caching library) last summer, https://github.com/ben-manes/caffeine/pull/24. It got some traction over the winter and is now showing up in other places (https://issues.apache.org/jira/browse/CASSANDRA-10855). In fact Ben Manes just had a guest post on High Scalability the other day http://highscalability.com/blog/2016/1/25/design-of-a-modern....

PS: If anyone is interested in these problems, We're Hiring.

edit: https://aws.amazon.com/careers/ or preferably drop me a line to my profile email or my username "at amazon.com" for a totally informal chat (Im an IC, not manager nor recruiter nor sales)

NovaX · on Jan 29, 2016

TinyLFU+FIFO is in the simulator (https://github.com/ben-manes/caffeine/wiki/Simulator). However, you'd probably also want the window cache to correct the deficiency outlined in the updated paper (http://arxiv.org/pdf/1512.00727.pdf). A FIFO version with that isn't in the simulator, but would be easy to add.

mutli1 trace - W-TinyLFU: 55.6% - TinyLFU+Random: 53.8% - TinyLFU+FIFO: 48.6% - TinyLFU+LRU: 54.8% - FIFO: 41.0% - LRU: 46.7%

FIFO's poor choice of a victim has a big impact, so it doesn't seem promising. A random policy looks like an attractive fit, though.

mvc · on Jan 29, 2016

Who's "we"?

cbsmith · on Jan 29, 2016

Yeah, with SSD's I wonder how much that really helps to improve performance vs. just no cache. Most SSD's have a lot of caching implemented internally, so disk cache can often be self defeating.

donavanm · on Jan 29, 2016

"It Depends." If youre doing "random" writes down to the block dev, like updating a filesystem, it can be very bad. You'll end up hitting the read/update/write cell issues and block other concurrent access. In general I'd worry (expect total throughput to go down, and tail latency way up) around a 10-20% write:read ratio. Conversely if youre doing sane sequential writes, say log structure merges with a 64-256KB chunk size, Id expect much less impact to your read latencies.

cbsmith · on Jan 29, 2016

This is a read cache though. If you just don't have it, you do reads on the SSD, which are pretty darn quick...

donavanm · on Jan 29, 2016

Ah, i must have missed the thread intent. I thought you were responding to the utility of limiting/optimizing cache write rate.

cbsmith · on Jan 29, 2016

You got that right. It's a funky cache. ;-)

nikolay · on Jan 28, 2016

Thanks for sharing this great insight!

zerd · on Jan 28, 2016

Comparison with Varnish from "Performance Evaluation of the Apache Traffic Server and Varnish Reverse Proxies" (2012) [1]

... the results indicated that Apache Traffic Server reached better cache hit rates and slightly better bandwidth throughput with the cost of higher system and network resource usage. Varnish on the other hand managed to response higher request rates with better response time, especially for the cache hits. The findings in this thesis indicates that Varnish seems to be more promising reverse proxy.

[1] https://www.duo.uio.no/handle/10852/34903

bobfunk · on Jan 28, 2016

Our CDN at Netlify (https://www.netlify.com) is based on traffic server and it's powerful plugin engine.

I've used both Squid, varnish and nginx plenty, but traffic server beat them in our benchmarks and the built in ssl termination + plugin api makes it extremely powerful...

lhedstrom · on Jan 29, 2016

Nice. How about adding Netlify to http://trafficserver.apache.org/users.html ? :)

retr0h · on Jan 29, 2016

Y!'s global CDN is based on this as well.

nikolay · on Jan 28, 2016

By the way, Google's PageSpeed module, which is available for Apache [0] and Nginx [1], is also available on ATS [2]!

[0]: http://modpagespeed.com/

[1]: http://ngxpagespeed.com/

[2]: https://www.iispeed.com/pagespeed/products/ats-pagespeed

exelius · on Jan 28, 2016

Can someone explain how this is different than how most people use Apache HTTPD these days? I'm not trying to be snarky -- I'm genuinely curious.

skuhn · on Jan 28, 2016

As others have indicated, it's a proxy server rather than a general purpose webserver. There's no code relation between the two servers; it's simply that the team at Yahoo chose to pursue it as an Apache Foundation project when they open sourced it.

ATS scales orders of magnitude better than Apache, due to its process model. Whereas at Yahoo we would budget between 30-200 simultaneous connections per Apache server (prefork), the proxy service which I ran using ATS was budgeted for over 100,000 concurrent connections per machine.

It's significantly less featureful than Apache, but it does caching substantially better than any other cache server commonly available (nginx, apache, squid, varnish).

jimjag · on Jan 28, 2016

"Apache, due to its process model."... "(prefork)"

Seems to me that the above is all based on reflections from decades ago w/ Apache httpd 1.3. Right now, all web servers can handle similar levels of concurrency with the bottleneck being the network pipe itself.

ATS is a great platform; using it in combo w/ Apache httpd (2.4) allows a pure open source implementation with all the power, speed, reliability one could want, and protection against Open Core business models.

skuhn · on Jan 28, 2016

Uh well, it wasn't "decades ago" but it was some old timey stuff. Yahoo used Apache 1.3 as recently as 2012. They also disabled Keep-Alive on Apache, and most properties would use the hardcoded default number of prefork processes (32). It wasn't the smartest setup.

Nevertheless, I don't think that nginx / Apache / Varnish / haproxy / etc. are able to handle similar concurrent connection levels as ATS without significantly impacting 95th percentile latency due to their core architectures.

jimjag · on Jan 28, 2016

There's good info here:

    http://www.slideshare.net/bryan_call/choosing-a-proxy-server-apachecon-2014

jjm · on Jan 29, 2016

Hey, nice seeing you here! Glad to see the project reach HN (finally). That indeed is a very good slide deck.

Some features that I preach are:

  - good turnkey default values
  - lua support
  - config options galore (Bryan labels it a con, but if you want control it's perfect)
  - good logging
  - historically proven scalability on large smp, xxlarge memory, multi nic systems

Edit: forgot to add one more thing though maybe not worthy a bullet point. If possible, a preference for physical rather than virtual is where I've seen performance with ATS shine. That is one reason why you would want as much config control possible.

skuhn · on Jan 28, 2016

Agreed, it's a good presentation and does a good job of being fair to the various software considered.

(I worked on a team closely related to Bryan's team at Yahoo)

exelius · on Jan 28, 2016

Yeah; I find many people in the modern web services world end up using Apache HTTPD as a caching proxy for availability reasons. It's probably just because it's old as hell so everyone knows the config file format by heart, but I know I went with Apache HTTPD on a recent project because the features we needed were only in the paid version of Nginx and we just didn't want to deal with licensing (the dollar amount was trivial in corporate dollars, but it would have taken us months to get the purchase through procurement and the entire rest of the project used open source software). So it's good to have another open source option to keep in mind.

therein · on Jan 29, 2016

What were these features?

exelius · on Jan 29, 2016

Application health checks mostly; though we were using it in a caching server context (static files and non-whitelisted endpoints served directly from Apache vs. hitting the app servers). They're available in Nginx Plus, just not the free version.

Again, the pricing wasn't the issue; it was the fact we would have had to go through procurement - which involves a few weeks/months of process at any decently large company. So we ended up using Apache because the team was familiar with it and knew it could support our use case. While Nginx probably would have performed better, in the end it was just easier to use Apache because it was a mature Open Source project and throw a few extra VMs at the caching tier to make up for the performance gap.

gruez · on Jan 28, 2016

> Can someone explain how this is different than how most people use Apache HTTPD these days

You mean as a reverse proxy / cache server?. I don't have statistics but I would think that most people use apache as a regular http server (serving files or as part of a *AMP stack)

exelius · on Jan 28, 2016

People still use *AMP stacks?

Edit: I was serious about that; I was under the impression that most stacks use much lighter weight http daemons than Apache these days. I understand that legacy apps are still out there and not everyone is going to refactor, but anyone developing web applications under Apache in 2016 is just a glutton for punishment...

mindcrime · on Jan 28, 2016

People still use _AMP stacks?

Change this to "People still use X?" where the value of X is pretty much any technology you've ever heard of, dating back to the 1970's (if not before). And the answer will, to a first approximation, always be "yes".

Now the number of people using X might be small, but you can all but bet your life that somebody, somewhere is, indeed, still using it. And depending on what it is, you might be surprised as how large the number actually is. Keep in mind, HN and Reddit, etc., comprise something of an echo chamber, where people of a certain mindset and orientation flock. The world is MUCH larger.

No, RPG on iSeries /AS400 machines isn't "cool" and you won't see it mentioned on HN much (if at all) but this stuff is still used all over the place. LAMP stacks? Yeah, still widely used. OS/2? Not exactly "widely" used, but still used. COBOL? Yep. Fortran? Yep. MVS? Yep. And so on. Now, granted, this stuff isn't used by "hip startups" or "unicorns", but the world is a lot bigger than the SV startup scene.

Technologies die VERY slowly for whatever reason, at least in regards to the "long tail" (so to speak) of the usage curve.

NeutronBoy · on Jan 28, 2016

> Now, granted, this stuff isn't used by "hip startups" or "unicorns", but the world is a lot bigger than the SV startup scene.

Most of it is used by business that literally manage your life. Banks, electrical and water, manufacturing plants, telephone exchanges, everything.

mrweasel · on Jan 28, 2016

I know a hosting company that exclusively deploy Apache, because that's what they know how to configure. Everything that works under nginx, will still work under Apache. If they need more performance, they put Apache behind a Riverbed Stringray/SteelApp/Brocade Trafic Manager (or whatever it's call these days).

I think people are being to hard on Apache, it's still a great webserver for a ton of applications. That being said I prefer to configure Nginx over Apache.

tempestn · on Jan 29, 2016

Also, if Apache is configured well, using the Event MPM (or even the older Worker) and sane thread counts for the server it's running on, it's a lot faster than it used to be. I can't say how fast compared to Nginx because literally every comparison between the two I've seen has hamstrung Apache by using the ancient Prefork MPM (and I haven't done a rigorous comparison myself), but I expect it's at least on the same order of magnitude.

rhizome · on Jan 29, 2016

They're both plenty fast for most sites.

jamespo · on Jan 28, 2016

It seems you're another person who hasn't used Apache since 1.3

pram · on Jan 28, 2016

this is closer to varnish or squid

pavs · on Jan 28, 2016

Anyone using this as a transparent forward proxy on a decent traffic (3-4gbps)? If so what kind of hardware needed to process 3-4 gbps tarffic?

blantonl · on Jan 28, 2016

Any advantages to using this over a blended haproxy / varnish setup?

skuhn · on Jan 28, 2016

Not really at small scale, but if you're building a large service there are several advantages.

For one thing, not having to copy response data between processes improves throughput. Since Varnish is so resistant to supporting SSL natively, you'll always have to place something in front of it to use it with the modern web. Whether it's haproxy, Apache or nginx, that's just one more thing to deal with.

I have some other beefs with Varnish, but the most annoying one is the absence of a persistent disk cache. If the Varnish process dies, there goes your disk cache. Even though cache data is written out to disk, Varnish punted on saving an index and re-using an old process's cache, so it writes the cache to an unlinked file.

Imagine a bad code push or new traffic pattern that causes core dumps across your entire service footprint -- and now it isn't just a problem of getting the process back up and stable, you have also lost hundreds of terabytes of cache data. Or something as simple as rolling out a new version. You can architect around the problem, but why should you even have to?

ATS also (recently) supports Lua for plugins, which is way more powerful than VCL. It is a finicky piece of software though, and there are a lot more sharp edges that you're likely to cut yourself on during the initial honeymoon period versus Varnish.

nikolay · on Jan 28, 2016

Varnish (just like Nginx) is putting key features behind a paywall. That's one of the reasons I personally want to consider Traffic Server. Plus, it's been around for ages, has a great architecture, and a great track record as well. All it needs is a little more awareness and that's why I keep posting it here. :)

skuhn · on Jan 28, 2016

I sympathize with the authors of Varnish and nginx; their software is used all over the place, and they want to make a living at it. I just don't want to support that kind of business model, and I'm never dealing with per-server license compliance again.

I wish more companies would model themselves after Percona: charge for support, custom engineering and on-call -- don't fork or paywall any code.

ATS suffers by comparison, since there is no "ATS Inc." to provide support and engineering work. There's OmniTI, but I don't have first hand experience with their service to say if it's worthwhile or not. They did get paid to write the current ATS docs, so presumably they know what they're doing.

I wish ATS got more attention, but it is after all a bit of a niche product hidden away in the Apache Foundation with a bunch of unrelated Java projects. It's too fiddly for small scale use, and once you hit large scale you're pretty much hiring someone from Yahoo or elsewhere that has experience running and developing it (for example: I'd like to hire ATS people). Doesn't give it a lot of opportunity to trickle into smaller shops and grow with their service.

nikolay · on Jan 28, 2016

There are right and wrong ways to monetize. Putting basic features behind a paywall and per-server licensing as you pointed out is not something I can live with. Even if I don't use these features, I feel I'm using a subpar product, and this makes me look for alternatives such as ATS and H2O [0].

[0]: https://h2o.examp1e.net/

nodesocket · on Jan 29, 2016

Disagree, The Nginx Plus Amazon AMI is not expensive and has hourly billing. Nginx guys wrote amazing software, they desire to be paid and paid well.

I am so tired of developer entitlement and complaining, and not wanting to pay for software. Most frustrating thing being a founder and engineer.

skuhn · on Jan 29, 2016

I disagree that it is not expensive. If you're launching a single instance then yes, it's tolerable: $0.21/hr or $1839 per year (very similar to their non-AMI pricing). Anything is tolerable at a low scale.

Now think about services that use nginx on every machine as a general purpose URL API interface. It's not uncommon, why bother re-inventing the HTTP server wheel. At a previous company, a service I ran would have cost $3.6 million a year in nginx plus licensing fees. Almost none of the added 'plus' features would have been at all useful.

If you see nginx plus as a way to pay for the core software then sure, maybe that cost is appropriate. I will not support them with per-server licensing of gated off features. Oh and by the way, nginx plus is closed source and is only available on a small handful of platforms, and doesn't always maintain the same release schedule as the open source version, all as a way of supporting their licensing scheme.

I would support nginx via professional services and support fees, but only in conjunction with the open source release. So it's up to them if they want that money or not.

nodesocket · on Jan 29, 2016

Let me ask you this, most of the "premium" Nginx Plus features are specific to load balancing. You shouldn't need more than 2 or 3 load balancers right? I.E. load balancer in each availability zone or multiple regions? Beyond that use Nginx open source. This is exactly what I do.

joshwa · on Jan 29, 2016

Inexplicably, explicit cache invalidation is only a Plus feature. (!!)

nodesocket · on Jan 29, 2016

But you should cache at the highest level right? Where you terminate ssl?

detaro · on Jan 29, 2016

Why wouldn't you use caching internally, between services? Only caching user-facing files is a very limited approach.

skuhn · on Jan 29, 2016

I get what you're saying, but now you have two divergent nginx releases to maintain. nginx plus isn't simply nginx with an extra module, it is closed source and on a separate release train.

For instance, the last nginx plus release (R8) is based on nginx 1.9.9 (plus was released 40 days later). The previous is based on nginx 1.9.4 (plus was released 25 days later). It isn't a matter of life and death, but it is an annoyance and unnecessary.

At least with Varnish Plus, it's still the same open source server with proprietary modules added in. I understand that nginx wants to go that route eventually, but they're hamstrung by the lack of dynamic module loading for the moment.

therein · on Jan 28, 2016

Apache Traffic Server has support for native C++ plugins which allow you to make transformations to the headers and the body as the request and response passes through it. ATS's power comes mainly from these plugins, I would say, rather than just a reverse caching proxy.

https://github.com/apache/trafficserver/tree/master/lib/atsc...

You can do stuff like:

  > registerHook(HOOK_READ_REQUEST_HEADERS_PRE_REMAP);  
  > registerHook(HOOK_READ_REQUEST_HEADERS_POST_REMAP);  
  > registerHook(HOOK_SEND_REQUEST_HEADERS);

  > if (transaction.getClientRequest().getUrl().getQuery().find("redirect=1") != string::npos) {  
  >   cout << "Sending this guy to google." << endl;  
  >   transaction.getClientResponse().getHeaders().append("Location", "http://www.google.com");  
  >   transaction.getClientResponse().setStatusCode(HTTP_STATUS_MOVED_TEMPORARILY);  
  >   transaction.getClientResponse().setReasonPhrase("Come Back Later");  
  > }  
  > transaction.resume();

voltagex_ · on Jan 29, 2016

Huh. You could conceivably replace ISA/TMG (Microsoft Forefront) with ATS. There are still places using those products as load balancers / reverse proxies even though support is almost completely gone.

ksec · on Jan 29, 2016

It seems strange to me, that Nginx, Varnish and mostly getting the headline and not ATS, any idea why may that be?

lhedstrom · on Jan 29, 2016

Likely because ATS is considered 'difficult' whereas the others are 'easy'. I'd argue that if you are running a serious site, you need expertise engineering regardless of which software you choose.

adrianpike · on Jan 28, 2016

At a company I was at a while ago we used/abused ATS _very_ heavily - apart from the pain of a comparatively obscure tool, it was great.

It took us some digging and work to get configured exactly right, especially since we were using it fairly nonstandard - as a caching forward proxy to external data sources.

Good stuff.

X-Istence · on Jan 28, 2016

Apache Traffic Server is what is used in Comcast's CDN.

aayala · on Jan 28, 2016

nginx ?

xinuc · on Jan 28, 2016

Apache TS is a caching proxy server. It's more similar to varnish than nginx (you can use nginx as a caching proxy if you want, of course).

I never use it, but from what I heard, it performs & scale really well too.

skuhn · on Jan 28, 2016

Some people do use nginx as a caching server. CloudFlare, for example, is built on top of it.

The reason to do so is because you want the rest of what nginx provides, not to get its caching module. It is an extremely barebones solution that only solves the most basic requirements. I can only presume that CloudFlare and others have written their own caching modules for nginx.

A short list of annoyances:

1. No support for multiple disk devices. Files are written to a fixed temp path and then renamed to their real destination. So you need to use RAID to present the disks as one logical device, which is a wholly unnecessary expense in a caching environment.

2. No support for purging in open source. This is an nginx plus feature, which starts at $1900/server/year.

3. Because of the temp file / rename thing, support for streaming subsequent requests off of the first request that is filling cache is janky. Subsequent requests have to acquire a lock.

4. No support for any fancier cache setups, utilizing ICP / HTCP.

nikolay · on Jan 28, 2016

I was considering Varnish + Nginx at some point, but this gets much more complicated than just using ATS.

skuhn · on Jan 28, 2016

It makes sense, considering that nginx is typically ahead of the pack on support for things like SPDY and H2. Plus if you use OpenResty with its Lua functionality, you can do a lot of fancy things in nginx and reduce your dependency on Varnish's VCL. And you have to have something in front of Varnish to do SSL anyway.

VCL in particular is kind of a trap. Early on it can do what you need -- remove a header, set a header, basic branching. Then you want to do basic arithmetic, or validity checking, or anything that isn't suitable for string assignment or regex and you straight up can't do it. VCL makes me long for the power of bash scripts.

Ultimately though, it's really not a great solution to separate these concerns between multiple applications. You're going to get bitten somewhere, even if it's just the old ephemeral port exhaustion problem.

nikolay · on Jan 28, 2016

That is true. I was particularly interested in tag-based cache purge and although there are similar open-source Nginx modules [0] and [1], they still don't have that out of the box.

[0]: https://github.com/pintsized/ledge

[1]: https://github.com/wandenberg/nginx-selective-cache-purge-mo...

skuhn · on Jan 28, 2016

Tag-based cache purges are something I would love to see in ATS. I think doing it correctly would require a complete rejiggering of the cache storage though, and that's not something to be undertaken lightly.

Storing externally in redis (for ledge) seems like the wrong approach to me. Better to store metadata externally and generate the purge URLs based on that. It's not ideal, but it's the best option I've come up with.

nikolay · on Jan 31, 2016

It's an essential feature that is dragging me towards Varnish even if this comes with tons of negatives.The bans feature is just amazing.

And right or wrong, there are not that many implementations to choose from in the Nginx land, unfortunately.

frik · on Jan 28, 2016

Better overview: https://en.wikipedia.org/wiki/Traffic_Server

Initially created by Inktomi, bought by Yahoo! and got open sourced and brought to Apache Foundation in 2009 because of the good experience of Yahoo! with Hadoop in 2008/09.