Most of this is relatively straightforward and unsurprising. But the one part that grabbed me is about "jittering". They insert random delays into timed events (the example given is cache expiration) to prevent a thundering herd problem when all the parts of the distributed system see the event at the same time (and for popular content, presumably repopulate the cache from the backend simulteously).
This is simple enough when described, but is not a technique I've seen applied much in practice or discussed in the community. I'm wondering if it's something that gets reinvented for all the projects that need it or if it's secret sauce known only in youtube. Regardless, I thought it was pretty insightful.
The Adblock Plus blog details the thundering herd problems they faced. Their ad blocking lists checked for updates every 5 days. Eventually, many users' update schedules would converge on Mondays because office computers did not run over the weekend. Updates that had been scheduled for Saturday or Sunday would spill over to Monday.
Windows does much the same thing w.r.t. policy refresh (which sucks down files from domain controllers) and update of "occasionally updated" metadata like last logon timestamp.
As an interesting counter point, Jeff Dean at Google says they prefer to have a known hit every once in a while, so they will have all the cron jobs go off at the same time versus introducing jitter. It perhaps has to do with their emphasis on reducing variability to control the size of the long tail distribution.
A nice extension of this is to use exponentially distributed delays. Then reoccurring events form poisson processes which are really easy to reason about when composed eg if you have N servers firing off events at the same exponentially distributed rate the times are distributed the same as if you had one server firing events at N x the rate and distributing jobs uniformly at random to the other servers.
A counterexample is that the Linux kernel tries to schedule timer events for the same deadline time. That allows the processor to sleep longer because the kernel doesn't need to wake up as often just to handle 1 or 2 timer events.
It's kind of a weird way to describe it. The better way is that you want your caching to be probabilistic rather than deterministic. In general, you want to avoid anything that would get the nodes in a distributed system to harmonize.
The other way to solve the problem though would be to handle that circumstance cleanly. There are ways to resolve a thundering herd without creating a scalability problem.
I've heard it called "splay" before e.g. `chef-client` will take an interval option and a splay option. Splay is a random amount of time added to the interval to prevent thundering herds.
In the past i've seen jitter used in all sorts of applications, from cron jobs to configuration management to memcache key expiration. Any time you have a crapload of nodes that all need to do an operation at a specific time you rely on jitter to keep resources from bottlenecking. Probably used anywhere the systems or network has a miniscule amount of resource headroom (like cluster nodes that run at 93% utilization)
I hadn't thought about it in the general case, but I do frequently find myself adding a sleep for PID mod some appropriate constant to the beginning of big distributed batch jobs in order to keep shared resources (NFS, database, etc) from getting hammered all at once.
What's this? Python? Apache? MySQL? But I thought you had to be running beta-release key value stores, esoteric web servers, and experimental programming languages if you wanted to scale!
> Vitess - a new project released by YouTube, written in Go, it’s a frontend to MySQL. It does a lot of optimization on the fly, it rewrites queries and acts as a proxy. Currently it serves every YouTube database request. It’s RPC based.
Remote Procedure Call is a design paradigm for synchronous call-and-response network communication. The Sun RPC protocol is an actual technology defined in RFC1057: http://www.ietf.org/rfc/rfc1057.txt
It's not insane, though not terribly relevant in the modern world. The only common technology still using it is NFS.
If you look at the original RPC work by Bruce Nelson [1], it's pretty clear that there's no strict definition of it. I think most would argue that SOAP would be included, which is still pretty common.
When Youtube refers to Vitess as being RPC-based, they are not referring to Sun RPC (callrpc), but rather to the generic design pattern of exposing service calls over the network. In particular, Vitess makes services callable [-] using either BSON or JSON serialisation over HTTP CONNECT calls.
I know the rage is async & nonblocking, but putting Python behind Apache is a good way to get predictable performance without concerns for calls that block, since they're mitigated via threads.
Also, it helps to have money since this approach requires more boxes. But as I said, it's very reliable.
The fastest, easiest, arguably most reliable way to scale is throwing money at it. And apparently it's easier to hire good people who know 20-year old tech rather than 3-year old.
The first 10 minutes are about monetization from one of the Youtube dev advocates. Skip to 9:45 to get to the "good stuff".
As an aside, this fellow is probably one of the best presenters I've seen from the pycon videos for this year. Confident, smooth, not reading from a computer screen or sheet of paper, clearly smart and in firm command of the subject matter.
I believe it could be faster still. Their Itoa cache is a map[int]string, where it could be a []string.
Also, I suspect that a few more primitive type special cases in the first type switch in EncodeField could go a long way.
I fully agree with Youtube faking data. However, I reckon they are faking a bit too much. Many times I would see 2000 likes and the video having 1700 views (Viral videos that is).
I knew the view counter wasn't propagated but the likes were and I was like: "Damn this is Youtube, kinda disappointing..."
I guess if both were propagated at the same time I wouldn't mind.
I honestly don't understand why they simply don't use out of sync data. You could have nodes periodically send aggregates of likes & views, and then add those in to the total ever N heartbeats. Why bother fudging the in-between.
They're probably propagating the likes and views independently. Which still doesn't explain why they allow counter-intuitive gaps like that instead of fixing them up on the client-side in javascript.
Yeah, the view counter is a clear case where they've gone too far. People care about view counts and it seems like every other video I watch has people complaining in the comments about the view count being obviously wrong.
The only time I notice this is when the view counts freeze at 300-ish for a few hours. That's because they're checking if the views are legit or from a bot, IIRC.
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." -- Brian Kernighan
Complicated code that isn't easy to maintain isn't smart, it's just complicated. One thing people screw up a lot is implementing some design pattern in a way that ends up shotgunning configuration or business logic over a wide area instead of keeping it in one location and using DRY.
At 11:28 he says, "at last count, there was over a million lines of python running this thing"
Having never worked with code-bases larger than ~50kloc, I have a lot of trouble understanding what 1 million lines of code is needed for, especially considering that python is such a high-level language.
Does anyone have any idea why there would be this much code?
Oh, it wasn't meant to be snarky or a slam -- I'm honestly just curious what kinds of things require so much code? E.g., is it one or two things that dominate usually in codebases this size, or is it just a LOT of components, each of which is tens of thousands of lines long? Do these kind of counts usually include auto-generated code?
I love the part on faking data. I take the viewpoint that only software testers care that the comment count is exactly correct in the majority of system. Users don't care.
I always thought it was a bug that view counts on YouTube varied depending on what page you saw them on. It can be different on the video view page, the channel page, and in your dashboard.
The publisher who uploaded the video might care. Especially so, if they are selling to advertisers. Comments, ratings, and favorites are three variables we calculate to reflect how engaging a video is.
If they feel strongly enough about it to leave a comment then I think it's safe to say it does hinder their enjoyment - in the same way that obviously broken things distract and displeasure in any medium.
My biggest gripe with youtube: why are comment almost always repeated? Yea i realize that most you tube comments are relatively worthless but I do tend to speed through them to get a feel for what the response is to a particular video. Inevitably I get through 20 comments and then the same 20 are repeated over again, often they are repeated several times. Perhaps they are trying to give the illusion of lots of comments or assuming the comments don't matter. Personally I find it extremely annoying, I'd rather them block to load more than repeat.
> The number of videos has gone up 9 orders of magnitude and the number of developers has only gone up two orders of magnitude.
2 orders of magnitude means at the very least, going from 9 to 100 developers, which is a huge increase, but it could mean way more. I wonder how big the team really is, and what the changing team dynamics are like on that scale at that pace.
I am not. From what I have seen of YouTube, comments are vile, and mostly there are two strangers posting pointless arguments about something equally pointless.
For me, YouTube is good for watching videos. If I want to discuss it, I post it on FB.
Awesome technique. The fastest function call is the one that doesn’t happen. When you have a monotonically increasing counter, like movie view counts or profile view counts, you could do a transaction every update. Or you could do a transaction every once in awhile and update by a random amount and as long as it changes from odd to even people would probably believe it’s real. Know how to fake data."
So all those people who buy views are kinda screwed now :-) I suspect this is a bad example. I HOPE this is a bad example, if only for the KONY2012 campaign :P
No no, the correct amount of views will be recorded for a specific video, it's just that each webserver doesn't know the exact number all the time. You make each webserver fetch the correct value perhaps every hour, and fake it inbetween. You'll get an ok approximation, users can't tell the difference, and you don't have to fetch the actual number every single pageview.
This is simple enough when described, but is not a technique I've seen applied much in practice or discussed in the community. I'm wondering if it's something that gets reinvented for all the projects that need it or if it's secret sauce known only in youtube. Regardless, I thought it was pretty insightful.