What powers Etsy

druiid · on Aug 31, 2012

It's very cool to see these kinds of posts. I always learn something from them and it's always neat to see what others have built to deal with a specific problem.

I wonder in this case why, if any specific reason Etsy is not utilizing virtualization of one kind or another? Basically they're building dumb-boxes that seem a much better fit to have a good number of slightly beefier VM systems and just rebuild bad instances from images (this would be especially good since they're already utilizing Chef).

I know that as an e-commerce firm this has helped us substantially to move toward this kind of a structure.

jgoulah · on Aug 31, 2012

FWIW we do use virtualization for our dev infrastructure (everyone gets a vm) which uses KVM/QEMU and our CI cluster which is LXC based

A bit about the dev infrastructure here: http://codeascraft.etsy.com/2012/03/13/making-it-virtually-e...

druiid · on Aug 31, 2012

Thanks for the reply. Also interesting to see dev side... we run something similar although a bit more free-form with the internal developers, but that's being pushed into a more controlled environment ;).

I should also say it's a bit heartening to know of other e-com groups out there benefiting from an in-house stack rather than attempting to go 'TO THE CLOUD!'.

jcampbell1 · on Aug 31, 2012

Several years ago, 37 signals wrote a post showing that moving away from virtualization improved response times.

http://37signals.com/svn/posts/1819-basecamp-now-with-more-v...

druiid · on Aug 31, 2012

Well at least one thing to note there, is that post is from several years ago... both virtualization technologies have gotten better since then, as well as processors.

I think there are certainly specific web and db loads that aren't going to work as well under VM scenarios, but on the whole our experience has been positive. There will, I imagine, always be a slight 'slowdown' due to VM but with the gigantic push toward 'the cloud' right now, it seems to me one of the last things on the minds of most (and obviously Xen is highly used in that field, which is also what we use).

The bigger question I think is if you are using a highly CPU bound task which VM isn't always very good at. According to Etsy they are very CPU bound, so it makes some amount of sense.

jbooth · on Aug 31, 2012

I think, generally speaking, once you start having >10 or >50 servers, VMs just stop making sense. You're getting a lot of hardware because you have a problem that requires it, and for most cloud systems with built-in redundancy, you'd only be running a single VM per machine anyways. Why not just run the machine? You should be using puppet or chef or something so it's not much/any harder to refresh a raw machine than it would be to refresh a VM image.

e12e · on Sept 1, 2012

Well, homogeneous management of applications/appliances on top of heterogeneous hardware is one reason. Another is testing and prototyping (on the same stack as you do production).

I think most shops with a high hardware need would probably have a mix -- things that scale well to the hardware (eg: database servers) might be set up as clusters of physical machines.

Less (performance) critical stuff like mail servers, bug tracking and time sheet appliances, backup domain controllers etc -- might be run as vms.

I agree that if you find yourself running a single vm on a physical box, you should probably reconsider how you are doing things...

Then again, all things considered -- if your (presumably) easy-to-mange vm based setup gives you enough performance -- why not keep it.

If nothing else as you upgrade hardware, if your workload doesn't increase exponentially, you will be able to consolidate on fewer servers. Consider a database service that in 2003 was set up master-slave on two physical pizza boxes. That could probably be run from two vms (on physically different boxes, for redundancy) -- with capacity to spare on either box today.

druiid · on Aug 31, 2012

I think this it where we get back to the CPU vs memory bound tasks. If the task allows you to be running 1/8th of a newer Intel/AMD cpu at any given time (also accounting for bursting), why not look if you can squish some instances together? CPU bound I fully agree with you (although I wonder how hyper-scale type companies like Zynga did an analysis at this level and decided upon their stack... if there's a writeup out their about their decisions I'd love to see it actually... sorry, off-topic slightly).

jbooth · on Aug 31, 2012

Most 'cloud' software like Hadoop, Cassandra, clustered Mongo or whatever makes it a point to solve that problem in userspace software, giving you a view of 'one big app' across a bunch of machines. Since that's your level of interaction with the system, VMs tend to not really make a lot of sense, they just give you new opportunities to make mistakes while imposing a small performance tax.

Now, if you have 10 legacy apps on old P4 xeons and want to lump them all onto a sandy bridge server and save a bunch of power and rackspace, then separate VMs on one machine makes sense.

Basically, a cluster of 30 servers, all running exactly one VM doing the same thing is on-its-face silly. Why even have the VM then?

travem · on Aug 31, 2012

The advantage I see with virtualization is that you could use those servers to run your Hadoop workloads on those 30 servers over the weekend and then repurpose those machines for other workloads during the week.

avleenvig · on Sept 1, 2012

That sounds really good in theory, but in practice it's less good. There are a number of different types of costs to consider: 1. The cost of server hardware 2. The cost of unused hardware capacity 3. The administration cost (people, skills, etc) 4. Opportunity cost

The Cloud(tm) excels at addressing some of these. You don't have to pay /as/ highly for staff to manage you machines and network (the provider does much of that for you). You can run smaller, cheaper instances much closer to their limits.

The downside is that you pay a premium to the provider (even if you are your own provider). Additionally you lose a great deal of opportunity cost. If you use the "excess" capacity on the weekend for hadoop jobs, and need to stop them because you have a large burst of traffic, you've hurt yourself. Your non-production hadoop jobs can also have unexpected and unintended impact on your production web servers, causing your users pain.

Given these things, if you focus is on keeping the best experience for users then you should split things apart.

The actual gain you get from cramming as much as you can onto one piece of hardware is much, much lower than you might expect. It ends up being easier just to get more hardware and dedicate that hardware for specific tasks.

- Avleen Vig, Staff Operations Engineer, Etsy

mikescar · on Sept 1, 2012

How long does it take in practice to swap out VM images in this way for different types of loads? How do you monitor that all 30 are ready to go and working right to spec?

prodigal_erik · on Aug 31, 2012

A hypervisor is just a kernel with a more clumsy API dictated by legacy hardware design. If you have eight services that can fit on the box together, eight processes on one good kernel should offer better visibility and control over what's going on than would eight VMs each hosting one process (along with some uninteresting crud).

VMs shine when you're stuck with proprietary software that can't all be ported to run on the same kernel, then what you're paying for is a compatibility shim that's cheaper than another box. I don't see much sense for software you wrote in-house.

Dobbs · on Aug 31, 2012

Your point about rebuilding bad instances is just as easy to do with PXE and Chef/Puppet on bare metal.

e12e · on Sept 1, 2012

"Just as easy" might be an overstatement. Equally doable might be closer to the truth.

One example where virtualization is easier in practice is where you're not in control of the network intfrastructure -- ie: you subcontract management of switches etc to someone else.

For someone the size of Etsy that wouldn't matter much -- they have an obvious need to manage the entire stack. But for smaller institutions that might not target "scaling up to serve the entire public web" -- it is a factor.

Another example is if you have heterogeneous hardware -- it is much easier to just install a hypervisor that runs on every box and then manage that via your vm-toolset, that set up many different images. It also makes it easier if you need to run different os' -- do all your os images have drivers for all your hardware?

Granted, in theory OS' should abstract away hardware -- in practice, if you need to change out a few of your mysql instances running GNU/Linux to Microsoft SQL servers for a new project/change in workload -- being able to just spin down the old vms and up the new ones helps a lot in terms of agility.

But yes, it's absolutely possible to provision physical hardware as well.

avleenvig · on Aug 31, 2012

sudo koan --replace-self sudo reboot -f

runako · on Aug 31, 2012

The OP says they can spin up a new bare metal machine in 10 minutes. It's likely the performance hit from running in a virtualized environment isn't worth shaving time off the 10 minute build.

druiid · on Aug 31, 2012

I'd say there are other reasons behind using virtualization than simply quick system spin-up, such as easier/better integration with (Dev|Sys)Op tools, higher hardware->instance densities than you'll ever see from physical... etc.

I think the performance hit for many people is overstated. Unless you're at the long-tail of CPU bound instances I imagine it won't become very apparent. That said, if I'm missing something/incorrect then someone please correct me!

avleenvig · on Aug 31, 2012

So, this is somewhat untrue for us.

For example, let's look at some common web stack infrastructure: * Web servers * Database servers * Monitoring servers * Asynchronous task processors * Scheduled task servers (eg, cron jobs)

In all of these cases, we want the service on that hardware to be able to use 100% CPU. For example, it doesn't make sense to get N web servers, and then run M virtual instances with apache on them, when each instance will max out its available resources. Just run N dedicated web servers :-)

The same is true for pretty much everything else too, sometimes for different reasons. Eg, you don't want your database servers or monitoring system to be impacted by multi-tenancy issues.

- Avleen Vig, Staff Operations Engineer, Etsy

runako · on Aug 31, 2012

I don't work at Etsy or have any specific knowledge, but it sounds like they already have the tool integration they need with their current setup.

Increasing densities through virtualization allows you to perform more discrete functions, but obviously reduces the aggregate capacity to do so. Etsy specifically has a goal of minimizing instance types, but virtualization would enable them go in a direction counter to their goals.

In my experience, the biggest weakness of virtualization is subpar I/O, and I would imagine that this would be an issue for Etsy as well.

Virtualization has a very real cost. If the performance hit (CPU & I/O) is (say) 20%, that's 20% of your capacity you pay to keep operational but don't get to use. The benefits don't outweigh that cost for every situation.

tedunangst · on Aug 31, 2012

It sounds like the reason they have so many machines is because they need a lot of CPU. Compressing those machines onto fewer CPUs would not be a win.

lordlarm · on Aug 31, 2012

The Supermicro servers are really really great - one of the best investments I've done so far in 2012.

I bought mine (a Supermicro 2U Dual 3GHz Xeon 4GB 12 Bay Storage Server) for 400$ on eBay, with great service and terms. Recommended if you are ever in the need (or just want) a private server rack unit (1U, 2U or more).

druiid · on Aug 31, 2012

I'm a big fan of Supermicro systems. They are probably one of the best bang-for-the-buck out there, although one has to purchase them with the understanding of the potential 'risk' of not having a giant support department behind them like Dell/HP. That's not a knock against Supermicro, just a note for the PHB type view.

ChuckMcM · on Aug 31, 2012

https://plus.google.com/u/1/108703267897818623506/posts/GK2A...

One of the rows in our data center, lots of Supermicro goodness. We use the 12 bay 'jupiter' chassis for 2x SSD and 10X 2TB spinning rust per server.

mbell · on Aug 31, 2012

I've had a few supermicro systems, some rack-mounted servers and a two 4U workstation pedestals.

I have one bad thing to say...they are loud, really _loud_.

Several years ago I had a 1U SM server in a rack full of HP servers I could hear the supermicro over all off them.

The 4U pedestal I have at home as a file server is the "super-quiet" model with 8 3.5" bays in front and it is still insanely loud. I had to to rig up a fan controller just to keep it livable to have in the apartment as the motherboard couldn't slow the fans down enough.

caseyf · on Aug 31, 2012

Here's a new Supermicro form factor that I really like that wasn't mentioned in the post:

1U with 2 CPUs, 10 x 2.5" drive bays, redundant power

http://www.supermicro.com/products/system/1U/1027/SYS-1027R-...

mbell · on Aug 31, 2012

I noticed something like this several times in the article: "Each machine has 2x 1gbit ethernet ports, in this case we’re only using one."

It sounds like a single switch failure could take down a large amount of their infrastructure. The seeming lack of geographic (or even datacenter) redundancy also seems a little dangerous at this scale.

avleenvig · on Aug 31, 2012

There is detail about multi-site redundancy which we're saving for a future post :-)

You're right that a single switch failure would be impactful, but the impact of one switch going down would actually be very minimal. For example our web servers are spread out between many switches. The switches and network equipment have greater redundancy between them. - Avleen Vig, Staff Operations Engineer, Etsy

donavanm · on Sept 2, 2012

LAGs or redundant NICs is a mugs game. How deep do you go? Dual TORs up linked to dual aggs to dual transit routers, each with N+1 transit? all fed by A+B power infra, of course.

Or keep it simple. One nic for prod traffic, one for private/control plane. Separate TORs are nice, or just don't oversubscribe them. Keep critical data hot across racks. Plan for failure of any given host, then for dc failure. You'll have site power or transit failure waaay more often than a bad tor.

incision · on Aug 31, 2012

Yeah, that impresses me as a bit odd too.

I've seen situations where folks have done this because they believed they would need VSS/VPC or similar on the switch side, but ALB on the server should avoid that.

Perhaps they have a n array of access switches, each serving some subset to lessen the impact?

brooksbp · on Sept 1, 2012

VSS/VPC == MLAG... but what is ALB? I am guessing server uses one link until failure then uses the other? Even then, there are still network issues to overcome that may justify MLAG.

avleenvig · on Sept 2, 2012

Honestly, we find it's much easier to spread the load between many switches, and have enough capacity that one switch failure is a non-event. Switches are expensive, it's true. After a certain point it does become less expensive and complex to just add more servers and switches. There's a strong advantage to keeping things as simple as possible. Bonding isn't really complicated, but how many not-complicated things can you add before things become complicated? :-)

maglos · on Sept 1, 2012

In fact while testing mrsync to synchronize search indexes among prod servers they overloaded their switches cpu and took down the site for a few minutes.

lazyjones · on Sept 1, 2012

We use those 4-in-1 Supermicro systems too, but with AMD CPUs. They seemed a much better deal to us and have a few more DIMM slots (= 128MB per node in our ~2-3 years old boxes). The only issue was the almost-half-length PCIe slot, it's actually 6" instead of 6.6" so you have to be careful when buying PCIe cards (might not be an issue with the Intel version board layouts).

We've also invested in a 10gbe infrastructure rather early (we're smaller than etsy traffic-wise) and found it to be very useful for avoiding congestion issues when we need to transfer large files or pull data from a busy Postgres server.

As for HP servers, we avoid them since our DL360 D6 reported obscure fan failures when booting with an Intel 10gbe card and we had to buy the (slower) HP NetXen-based cards instead. Supermicro boxes are not that picky and you don't get the feeling that a vendor might be using dirty tricky to lock you in.

donavanm · on Sept 2, 2012

"Nagios checks to let us know when the filesystem becomes un-writeable" Any specific reason you don't just panic? I've had RO fs edge cases bite me a few times.

"In the future we may even run fiber to these machines" Anything against 10g copper? Per port prices are getting crazy cheap.

"DNS servers, LDAP servers, and our Hadoop Namenodes ... need RAID for extra data safety" Anything you've had issues with? Host level redundancy with the first two has been pretty great IME, can't comment on name odes. The only issue I've run in to is poor LDAP clients not fast failing an unhealthy server when another is available.

avleenvig · on Sept 3, 2012

RO filesystems can be bad, but usually they're soft failures for us: * Memcache can still work just fine * Db servers stop responding (and the app handles that fairly gracefully) * Web servers serve files from a RAM disk, so they keep working

No reason against 10G copper specifically - we haven't had to address the problem in detail yet. When we do, we might choose copper. Depends what happens when it happens :-)

We had a very nasty incident a few months ago, where the drive in one of the LDAP servers died. Well, it sort-of died. It started to time out a lot but didn't go fully offline. openldap kept running, but when you connected to it, the TCP connection would open and hang. This meant that all of our servers saw the server as "OK", but LDAP stopped working and caused all kinds of brilliant havoc :-)

donavanm · on Sept 4, 2012

Interesting. I was thinking failure to persist updates, inconsistent health check results, those types of nasty corner cases.

Sure. Premature optimization and all.

Huh. SSD from a major chip vendor perhaps? I had one of those bite me just a few weeks ago. in my case it would have been fine if the drive offlined itself. but the intermittent io pauses killed performance, but not health checks. Actually, same type of problem RO FSs make me scared of.

chrismealy · on Aug 31, 2012

Anybody know what they use hadoop for?

avleenvig · on Aug 31, 2012

Tons of stuff! Stay tuned for that in a future post :-) We posted about this in 2010: http://codeascraft.etsy.com/2010/02/24/analyzing-etsys-data-...

Probably time for a refresh!

- Avleen Vig, Staff Operations Engineer, Etsy

chrismealy · on Aug 31, 2012

Cool! Thanks.

mcfunley · on Aug 31, 2012

Analytics, search indexing, recommendations, various datasets that support search ranking, and some other things.

EliRivers · on Sept 1, 2012

It's powered by love of crafting stuff :p

avleenvig · on Sept 1, 2012

If I could hand-craft circuit boards and CPUs, I probably would :-)

jonhendry · on Sept 8, 2012

And lots of Chinese mass production.