How Facebook Ships Code

Silhouette · on July 5, 2012

Am I the only person who thinks this whole approach is broken?

We have seen the rise of "devops" recently, and big name web companies like Facebook and Google seem to be very proud of how engineer-led they are, how empowered their developers are, how their product managers don't have much real authority, how they push code to production ten minutes before it's even written, and so on.

From the outside, I see systems that are always changing, so users can never rely on anything working the same way from one visit to the next. I see organisations with access to sensitive personal data being cavalier at best about how they handle it. I see a major blunder every few weeks that at best causes serious irritation to users and at worst risks significant loss of business and/or legal/regulatory consequences.

And I see huge brands whose egotistical staff don't realise that they are successful despite evidently not bothering with either real product management or robust testing, not because of these things, and who don't seem to notice that they are still relying almost entirely on momentum from one or two really big successes from early on to maintain their user base and revenues.

jrockway · on July 5, 2012

I wouldn't lump Google in there with Facebook. We have very strict controls on access to user data (we can't even see email addresses in logs), and we have not adopted the motto "move fast and break things". We do extensive automated testing and have release processes that are designed to minimize problems in the event of a bad push. Yes, bugs happen from time to time, but it's software -- there is no known practical technique to produce bug-free software, so we have to settle for mostly-bug-free software instead. This isn't being amateurish or egotistical, it's being realistic.

Silhouette · on July 5, 2012

I'm not arguing for unrealistic quality levels, and I will acknowledge immediately that my experience could be atypical.

However, in fairness, if I look at all of the software that I use regularly in a professional capacity today, then it is clearly Google products that are the most buggy, and by a very wide margin.

For example, I have a client who uses Google Docs/Drive. We have rarely managed to hold a meeting without one of our small team struggling to see either a word processor document or a spreadsheet properly, and that's just the minor cosmetic or browser incompatibility bugs that keep appearing along with all those minor UI changes. We have also experienced some much more serious problems, including corruption/data loss and seeing the entire change history for some files become inaccessible for no apparent reason. In other words, it's not just minor UI errors that crept in as the product evolved, there are evidently fundamental flaws in the underlying architecture that don't store the data robustly.

Another example: I spent a couple of days last week trying to figure out why a site that had been working fine for users until recently and had not been changed at all on our side was suddenly generating bug reports. It turned out that recent Chrome builds have broken some HTML5 features in multiple ways. There have been related bug reports in some cases, but they have been closed as the specific example given no longer seemed to be a problem. Again, the nature of the problems makes it obvious that these are not just one-liner issues but fundamental flaws, typically where Chrome is so aggressive with its cache usage and/or trigging redrawing/relayout that it just plain doesn't work. And even though the bugs had been reported, obviously the correct root cause was never identified and fixed.

Another example: I recently spent some time looking into how the Closure Tools are coming along. Have you tried the examples/demonstrations for the Closure Library recently? Many of them simply don't work in Gecko-based browsers, and that would be obvious if anyone working on the project had made even a cursory attempt to test them for five minutes.

I will finish by once again acknowledging that my experiences here may be atypical, and that the particular projects I've mentioned that I happen to be using may not reflect the wider Google culture. But on the evidence before me, I see an organisation that keeps breaking things in its rush to push new features out and that demonstrably lacks robust architectures that keep data safe, effective testing processes before pushing code into production, and proper issue resolution processes when defects do get reported.

wpietri · on July 5, 2012

I saw one interesting analysis of service bugs. I can't find it now, but the basic point was that people should stop focusing on MTBF (median time between failures) and instead look at MTTR (median time to recovery).

The notion was that it was much better to be able to notice and fix bugs quickly than it was to add process delays that reduced bug frequency. That especially makes sense to me when you think about the value integral. The longer your release cycle, the longer you keep people away from improvements. It also makes sense if you think about companies as learning organizations; speed of learning is limited by the length of your feedback loops.

So I'd say that Google's mistake isn't going for a release-early, release often approach. It's that they don't pay sufficient attention after release to user feedback or user metrics. Hell, I've stopped reporting bugs to Google, and I must know 10 people who work there. Even with a good backchannel I just have no faith that anything would come of a report.

jrockway · on July 6, 2012

However, in fairness, if I look at all of the software that I use regularly in a professional capacity today, then it is clearly Google products that are the most buggy, and by a very wide margin.

It's more likely that you just spend more time using Google software than other software. When I worked at BofA, our website was down for an entire week because of a bad push. No online banking for a week. That's pretty much the standard for the industry. I don't doubt that there are bugs in Docs or Chrome, but they are relatively obscure. That's the nature of software, not every bug is caught by a unit test and users end up seeing them and reporting them. (Oddly, we use Docs heavily at Google, and the only bug that I've noticed is the "Your zoom level is not supported one". I haven't personally hit any other issues.)

But like I said in my original post, if you know how to develop bug-free software, I'd love to hear how. Expecting low-cost web apps to be as reliable as airplane control systems is unrealistic.

Silhouette · on July 6, 2012

It's more likely that you just spend more time using Google software than other software.

Sorry, but I'm really not. In the case of Chrome, for example, I would routinely test new work on web projects in all the major browsers. I am writing this with some empirical data in front of me, because I've just checked the bug tracking systems for a couple of projects I work on to be sure: for both projects, excluding mobile browsers, Chrome has been responsible for a clear majority of all defects where the root cause was found to be a browser bug over the past two years.

I don't doubt that there are bugs in Docs or Chrome, but they are relatively obscure.

Respectfully, if they were that obscure, my colleagues and I wouldn't keep running into them on multiple projects. I'd agree that the particular symptoms of any particular bug are usually a corner case: set this option to A and that option to B and it breaks, but other combinations work OK. It's the way that several of these bugs have recurring themes that betray both an underlying architectural weakness and a lack of effective quality control that I find most unfortunate.

Using Chrome as an example again, it is clearly aggressive with caching and conservative with repainting, but sometimes that means it is simply not behaving properly at all. If I set a part of the DOM to display instead of being hidden and then send an AJAX request, I want my "please wait" message displayed while the request is running, not afterwards, or indeed not at all since it probably gets hidden again as soon as the response arrives.

That's the nature of software, not every bug is caught by a unit test and users end up seeing them and reporting them.

And as I mentioned, several of those bugs in Chrome had been reported, and subsequently closed without the root cause ever being identified and fixed.

Expecting low-cost web apps to be as reliable as airplane control systems is unrealistic.

I don't expect Google's software to be as reliable as airplane control systems, but somewhere close to as reliable as everyone else's software would be nice. I appreciate that you're having trouble believing it or reconciling it with your own experience, and I've already acknowledged that my experience might be a complete outlier, but I'm looking at several years of empirical data across multiple completely different projects and development teams and it is quite clear that the Google products I'm looking at here haven't been keeping up lately.

jrockway · on July 6, 2012

And as I mentioned, several of those bugs in Chrome had been reported, and subsequently closed without the root cause ever being identified and fixed.

Link?

Silhouette · on July 6, 2012

See issue 104487 for one recent example. An issue was flagged up where HTML5 videos weren't playing properly when given a poster image but no controls attribute.

The issue was closed almost immediately, with some obviously hastily written comments, apparently because no-one could reproduce it in a different version of Chrome on Linux or Mac. As far as I can see, no-one even tried to reproduce the other reported failing case on Windows, and there was no attempt at all to investigate the original bug and determine how it happened and why it was no longer observable on the platforms tested.

The issue was simply marked "fixed", despite no actual fix having been identified, rather than giving it a more specific "no longer reproducible" status.

There are still serious problems with that combination of attributes today.

wilfra · on July 5, 2012

Where HTML5 is concerned, Chrome is by far the most buggy browser. It's unfortunate because over 50% of our users are using it now.

Silhouette · on July 5, 2012

Where HTML5 is concerned, Chrome is by far the most buggy browser.

I agree, though FWIW I'd say it still takes more work to support Safari on iOS because of all the odd quirks. Apple might not consider them bugs. They're entitled to their opinion. ;-)

It surprises me that we've reached this point, but several of the web-based projects for business users that I work on professionally now recommend IE9 and won't officially support either Firefox or Chrome, and I agree with them. The trendy culture of frequently pushing subtle changes and half-baked new features and then hoping that if anything breaks you can fix it fairly quickly might work for places like Facebook, but it just doesn't cut it for professionals.

chris_wot · on July 5, 2012

I'm interested in that... What breakages stop folks from using Firefix?

Silhouette · on July 5, 2012

The most recent major one I know was the LTS release, in which they seriously broke embedded content like Flash and Java applets.

The issue was known and in the bug tracker several days before the release, yet they went ahead and released anyway. Consequently many sites and browser-hosted user interfaces across the world were broken.

It was then a very long time before another patch was pushed to fix the regression.

I'll leave it to you to decide whether it's worse that such a serious bug could get into the repo in the first place, that they still pushed the update to everyone even though they knew about the problem, or that they allowed sites that rely on these technologies to be broken for so long before correcting their mistake.

batgaijin · on July 5, 2012

Hey man, don't anger the hive.

dexen · on July 5, 2012

> Am I the only person who thinks this whole approach is broken?

Broken? Depends on your perspective.

If you care first and foremost about your users' data, it sure is broken. But FB is not a hosting company.

If you care the most most about correctness of your code and getting everything right the first time (VMS style rather than UNIX `worse is better' style), then again, it will seem broken. But FB is not an academic excercise.

If, on the other hand, you felt like a stakeholder in the company and cared most about developing the business, then no, `moving fast and breaking things' is means to that end.

If, alternatively, you enjoy experimentation and pushing frontiers, then again, no, `moving fast and breaking things' is about the only modus operandi that can accomodate that.

Emphasis on product management and testing is for serving your customers -- but users aren't FB's customers. I cannot assure you, but I would bet a lot FB does indeed heavily test and manage projects related to serving ads -- which is what their customers pay for.

jacques_chester · on July 5, 2012

I think Facebook is successful in spite of their engineering culture, not because of it. They started with the densest social graph of any social network and that made the difference in the long run.

wpietri · on July 5, 2012

"Broken" seems strong. I'd agree with "has flaws", but every system has flaws.

If I were Zuckerberg, I'd be terrified of slipping into a premature complacency or a heavyweight process that prevents change. After years of social network king-of-the-hill, Facebook seems to be set up for the generational dominance of a Microsoft or a Google. But it's not locked in yet.

Sure, current practices risk irritating users or a severe FTC beatdown. But those are obvious risks, easy to measure and mitigate. The real harm to Microsoft didn't come from annoying UI or regulatory intervention. It came from long-term shifts that they were too culturally isolated and too inflexible to understand or adapt to, much less drive.

There are certainly downsides to shipping early and often and then seeing what happens to your user metrics. But bottom-up power to ship means that a lot more good, novel ideas will see the light of day.

chris_wot · on July 5, 2012

If I were Zuckerberg, I'd be terrified of slipping into a premature complacency or a heavyweight process that prevents change.

He just listed Facebook on NASDAQ - cant think of a single business decision that would do more to cause this result than that!

priv_acy · on July 5, 2012

To be fair, he really had little choice.

SatvikBeri · on July 5, 2012

Exactly. Companies are legally required to go public if they have >500 shareholders, which Facebook has had for a while due to investors + stock options.

accountoftheday · on July 5, 2012

Wrong. You can stay private with >500 shareholders, you just have reporting requirements matching those of a public company.

SatvikBeri · on July 6, 2012

You are correct. I should have been more precise: once a company has >500 shareholders, they the same reporting requirements as a public company. Since reporting is one of the biggest disadvantages to going public, this means that companies with >500 shareholders are very strongly incentivized to go public.

snowwrestler · on July 5, 2012

It's the reporting requirements that create the risk of stifling bureaucracy.

By listing on NASDAQ, Facebook at least gains some capital in return for this reporting.

omegant · on July 5, 2012

As a simple user I sometimes find annoying all that continuous tweeking and changing stuff in facebook. Maybe there is a reason in user behaviour and adds conversions, but if doesn't please don't make all that uselss changes just for fun!. The last timeline mayor update is specially dull, I find it hard to navigate with that splited view. Of course I am only a user, surely the cohort stats show how all the millions of people rave about each and every change... In my opinion is far more important fixing bugs like the one at the Ipad app. When you visit some comments in a picture you have previously commented, facebook will open another random picture or comment page of the user who posted the picture. It 's the same on the older iphone app( I have the 3G and I am not updating it, so it's my own selfinflicted bug), but I just downloaded the ipad app two weeks ago, and there it is, safe and sound the same ol' bug. Sorry for the final rant, but maybe they must have some department curating the user experience beyond the engineer centered culture. Maybe is better the google way, work on stablished problems, and then give people some time to work on side proyects and give them the oportunity to grow.

lmm · on July 5, 2012

>And I see huge brands whose egotistical staff don't realise that they are successful despite evidently not bothering with either real product management or robust testing, not because of these things

Citation needed. The massive public success of these companies would argue against you, as would my personal experience working with and without traditional project management.

Silhouette · on July 5, 2012

The massive public success of these companies would argue against you

What massive public success would that be?

Let's ignore the underlying platforms that originally made them: search, ads, and arguably GMail for Google; and reaching critical mass on the social graph and ads for Facebook. Those were developed long before the developer culture we're discussing here. Since then, what massive successes has all this developer-led initiative produced?

Google has made or bought a string of flops, most of which have been shut down one after another. It's doing OK with Chrome and Android, but neither of those is exactly a shining example of high quality software development.

Facebook has... Well, what? Repeated changes to the UI that seem to irritate more people every time(line), repeated privacy lapses, and repeated backtracks in the face of widespread criticism from their user base?

Neither business seems to have any sense of direction. Google seems to be trying to consolidate under Page by aggressively cutting out anything that's not a big ticket item in the portfolio, which might well be a good move, but it remains to be seen how well they can innovate using anything other than the what-sticks-to-the-wall strategy that they've adopted in recent years. Facebook just seems to be trying to find more ways to exploit information their users have explicitly decided to make private and get away with it.

lmm · on July 6, 2012

>Let's ignore the underlying platforms that originally made them: search, ads, and arguably GMail for Google; and reaching critical mass on the social graph and ads for Facebook. Those were developed long before the developer culture we're discussing here.

Huh? Do you think they started out as straight-laced businesses with traditional project managers, and then arbitrarily decided to get rid of them after they became successful? Their cultures were always like this; did you see the recent story here on how Facebook was in its 30-person days?

Silhouette · on July 6, 2012

Do you think they started out as straight-laced businesses with traditional project managers, and then arbitrarily decided to get rid of them after they became successful?

No, I think they each started out with one good idea that they executed well enough and I think each also had the good fortune to be in the right place at the right time. They made great successes out of their respective founders' big ideas and their first funding rounds, but that was nearly a decade ago in Facebook's case and well over that for Google.

Today, these organisations have been considered among the most desirable employers for many of the best and brightest software developers of a generation. They have more money than they know what to do with because of those original successes, and they have thousands of very smart people working for them. No-one can credibly claim that they don't have vast and talented software development teams.

And yet, neither has produced an industry-shaking development in several years, never mind creating any new markets, and their original successes continue to bring in by far the lion's share of their revenues. I don't see how you can blame that on anything other than a lack of leadership and vision from the other parts of the organisation. The people who are supposed to be guiding and nurturing and co-ordinating just don't seem to be there, and it seems they're still trying to operate like that 30-person organisation, just scaled up by an order of magnitude or three.

sreyaNotfilc · on July 5, 2012

This may be my favorite quote...

"Engineers handle entire feature themselves — front end javascript, backend database code, and everything in between. If they want help from a Designer (there are a limited staff of dedicated designers available), they need to get a Designer interested enough in their project to take it on. Same for Architect help. But in general, expectation is that engineers will handle everything they need themselves."

I actually like this idea. Building my own website as well as working as the senior engineer during my day job forces me to be involved in all facets of web developing. The jobs are not abstracted. You are expected to know what you're doing from the front end to the back end. If not, nothing get done in a timely manner. I like this method because the front end is tightly woven in with the business logic of the module that you're working on. In other words, you know what the code is doing inside and out.

crazygringo · on July 5, 2012

I dunno. I personally am a full-stack guy.

But when I work with back-end developers who don't know MySQL inside and out, and they write a query that works fast in development but slow in production because they didn't realize the column index they specified won't work because of a string collation incompatibility between tables, and they've never even heard of this kind of problem before...

Then I wish they'd stick to writing their back-end executable code, and have a database guy write their query for them, and have a JavaScript expert take care of the front-end stuff.

It's not a question of intelligence, just a question of experience. You have to have done a lot of JavaScript coding to realize never to call parseInt(x), but always parseInt(x, 10), for example. And the number of CSS incompatibilities between browsers that you need to account for...

Unless someone has a lot of full-stack experience, it doesn't seem that great that they should be pushing out full-stack code on their own at a place like Facebook. Maybe code reviews will catch those kinds of things, though.

underwater · on July 5, 2012

Facebook has a good abstractions at all levels of the stack. This allow engineers to work on all parts of the stack without having to worry about low level implementation details. I have not written a line of SQL while at Facebook.

If there is some domain knowledge needed then I can just rope in an engineer from the appropriate team to review my diff.

It's also worth noting that on product teams engineers tend to gravitate towards the parts of development they are more comfortable with, or enjoy the most. They are not forced to work with te full stack just for the sake of it.

sreyaNotfilc · on July 5, 2012

My point is that sometimes you don't have the luxury of having x amount of DB people and x amount of js folks and x amount of server side guys and so on and so forth. I'm coming from a place with very low resources (I'm the only developer at my job right now, we use to have 4).

(I wrote this stuff before but for some reason it just didn't go through)

As far as testing wise and building things, I believe it is best to have an engineer that is able to fulfill those roles for the following reasons... 1) They understand the structures and what it is that makes an application do what it does. 2) Its a very good learn tool. If you mess up you know exactly why. You'll be able to constantly adapt yourself to become better. 3) Waiting for your "role" to be utilized within an organization may sound great because you have free time, but it causes stagnation and "Diva-ism". That is the whole "Its not my job, so I don't care about your problem" or "You need me more than I need you" attitude. 4) You can move relatively quickly since the code and logic is coming directly from you.

Yes, specialists are very important. For large organizations they are very needed. But, to get things to work right now, you need to be flexible and open-minded. Look at Facebook's UI, its not the prettiest (blue bar at the top and 3 columns), yet its an extremely popular product. Most users just want the damn thing to work.

I'm no web designer, but I understand that a clean UI is needed. I'm not a DB guy, but I can figure out most problems. Etc. etc.

I think that's what Zuck actually wants in his engineers. Guys and gals that can see a problem and fix it on their own. Guys and gals that can think of a cool idea, test it, and have it released for that 1%.

Code reviews and peer help will smooth things out. I believe it was Jamie Zawinski's interview in the book "Coders at Work" that had the same mantra when he developed Netscape. Just get the thing out fast. You'll get more eyes on it. And you can always make it better the next time around.

It seems like a lot of shops do that. Its a cut throat world and if you spend too much time trying to be super clean and precise, you'll still be on version 1 while your competitor is on version 10.

polyfractal · on July 5, 2012

Eh, I dunno. I don't want the carpenter designing my house, and I don't want the architect cutting the main support beams. I want both of them doing what they do best, and working together to do it.

wpietri · on July 5, 2012

I think that's a false dichotomy. The good carpenters I know are all reasonably good designers. And I would never hire an architect who didn't have some experience actually making things; otherwise there's too much risk of getting what software people call "architecture astronauts".

I think the right solution is the approach IDEO takes: teams of what they call T-shaped people: broad general understanding, with great depth in a particular area.

In Facebook's case in particular, I don't think they need a lot of brilliant design as much as they need solid, competent, well-integrated features. Indeed, their latest attempt at design brilliance, Timeline, has been much less well executed than their zillion engineer-led evolutionary changes.

jbl · on July 5, 2012

The IDEO approach is something that I like and look for in the people I work with. When I was teaching I grad school, I'd often exhort my students to think at least one level 'above' and one level 'below' whatever phenomena we were studying that week.

With software, I think it's important to see the forest for the trees, so that you can make the best tree for the forest... (apologies for the tortured analogy)

alttab · on July 5, 2012

I agree. Specialists are really just full stack engineers who haven't fulfilled their potential. Why hire a self-proclaimed UX guy? Unless I have all the amazing engineers I need for my team (99% not the case), I'm looking for someone who can do multiple things. The best ones are ones that can do schema design, query design, system and model architecture, front-end wireframes, javascript, and implementation.

That said, your comment about the front end being "tightly woven in with the business logic of the module" scares me a little. The front end is presentation and interaction. Unless most of your businesses mojo is client side, even a full stack engineer should design each layer as fully cohesive and encapsulated components. I may have misunderstood you, however.

sreyaNotfilc · on July 5, 2012

What I mean by "tightly woven" is that you know exactly what values/components/types need to communicate with each other whether its the front end or the back end. You're not wasting time trying to match a spec sheet or fulfill a precondition to an existing back-end procedure (cause honestly those things just don't fit right in). That is, you're not spending time trying to put a square peg into a round hole.

alttab · on July 5, 2012

Ah I getcha now. I think my point is still valid though - and in fact full stack engineers are more likely to write a spaghetti code mess from the back to the front because they control the entire feature and may see the code in its entirety as one "feature" and thus one code base.

sreyaNotfilc · on July 5, 2012

You are valid. I wrote a whole post about this but the reply didn't seem to work so I lost the full reply.

I did write most of my ideas to "underwater"'s comment.

Basically, my position is more on the fact that if you need things done quickly, then a full stack guy is the one you need. You can fix/abstract the code later. In fact, I encourage that. But to get the ball rolling, you're gonna need some quick and useful (sometimes ugly spaghetti) code.

Its a trade off that must be made, especially you're rapidly building your product.

gouranga · on July 5, 2012

What a crock of shit:

after boot camp, all engineers get access to live DB

I can understand on a startup or small org but an organisation of that size, there should be very tight access control.

Despite what anyone says, the probability that someone does something bad increases in larger groups. Security should be on a simple need-to-know basis and nothing else.

I build BIG financial software and we have certain audit requirements, access control requirements, data protection requirements etc and that is exactly how it should be.

I'd never put my data near FB. They are simply irresponsible.

piggity · on July 5, 2012

You finance and billing guys will never understand.

These are rockstars.

They work for facebook.

They would never ever type UPDATE users SET email = username || '@facebook.com'; WHERE username == 'john.smith';

jacques_chester · on July 5, 2012

Of course not. Much faster to type

    UPDATE users SET email = username || '@facebook.com';

Then let the users sort out the exceptions.

alttab · on July 5, 2012

I think he was being sarcastic - you can see the semicolon in the middle of the command which looks like pretty much what they did with the whole e-mail debacle. You simply removed the second incomplete statement.

jacques_chester · on July 5, 2012

I now realise that I am underqualified to work at Facebook. :(

getsat · on July 5, 2012

>implying this was an engineering team decision

gaius · on July 5, 2012

Right. FB doesn't get that all they have is the user experience, and it is bad user experience if I drop out of a conversation I'm having because their DBA-free database has decided to simply, silently drop my last comment. Or if it takes me several attempts to post a photo. Or if I invite someone to an event, and they never get the invite.

You can do all sorts of amazing things if you just don't care if your code actually works or not. Grown-up companies have full-time professional DBAs, and not just for separation of concerns.

darkarmani · on July 5, 2012

> Grown-up companies have full-time professional DBAs

I don't know if I'd call them "professional," but they definitely have full-time DBAs. ;)

mkjones · on July 6, 2012

What do you mean? I've worked with some of our DBAs, and they're quite good. In fact, I can't think of a single site issue that was caused by a DBA. I work on fighting spam at FB, and we make use of mysql quite a bit.

jorgeleo · on July 5, 2012

But Facebook is not a financial institution, and I would put the responsability in the user if financial data is on the news feed.

One size (of engineering practices) fit all mentality is not the correct approach either.

gouranga · on July 5, 2012

You are correct - it is not a financial organsiation.

It does know who you are, what you eat, where you've been, who you're friends with, what you're interested in, what your political allegiances are, what your bowel movements are like etc.

A list of investments or your mortgage statement is way less important.

jdsemrau · on July 5, 2012

>I build BIG financial software and we have certain audit requirements, access control requirements, data protection requirements etc and that is exactly how it should be.

Funny, I had the same argument with a colleague today. Since FB is listed and their business model is based on customer data they should follow the same legal requirements as customer finance companies.

yashchandra · on July 5, 2012

"they should follow the same legal requirements"

Perhaps. But they certainly do not have the restrictions of financial audit requirements and things like Dodd Frank etc. which is overwhelming all the major banks right now and is a big way to earn money for consultants if they know ABC of audit/compliance in 2012. Anyway, the point is that fb even though not ideal in how they store/use user data (which I personally am not a fan of as well), it does not matter as much as it matters to a bank/financial institution. I say this while I work for clients that are major banks.

jschuur · on July 5, 2012

Note that it doesn't say whether they all have write access, or whether all the data they have access to is encrypted in some way.

And any of those statements should be interpreted as one response he got from talking to lots of different people, and be subject to some amount of skepticism.

gouranga · on July 5, 2012

Either way, it simply doesn't matter.

TazeTSchnitzel · on July 5, 2012

It's a little irresponsible, but how else do you expect them to debug live code?

arethuza · on July 5, 2012

"how else do you expect them to debug live code"

I've actually spent quite a long time writing code where, for various legal reasons, you are never going to get direct access to the production data. (And indeed some cases where accessing the data in question would be a serious criminal offence).

Makes you really keen on defensive programming and comprehensive logging - although you do then have to be careful about what you log.

I actally found it kind of fun to try and debug complex systems based only on a log file and QA with production operations folks.

[NB Of course, none of this would be applicable to Facebook]

gouranga · on July 5, 2012

Thou who uses unit tests, UI tests, precondition, postcondition checks, heavy logging, knows the language and toolchain and knows arse from elbow NEVER debugs in production out of necessity or desire.

It just works.

In the last 5 years, we've never hooked a debugger to production instance or given access to a developer.

In that time, we've pushed approximately 912 billion SQL transactions through the system and 24 billion page hits (most of our stuff is backend non OLTP).

That's how it's done.

rimantas · on July 5, 2012

No please tell me, how did you learn to write unit test which cover all the crazy data configurations users are able to come with.

gioele · on July 5, 2012

Boundary value analysis? Machine state transitions testing? Equivalence partitioning?

There are heaps of techniques one can use to identify which data points and combinations are useful to test. Some of these techniques even take a peek at your code to highlight the possible pain points.

gaius · on July 5, 2012

LOL! You think someone who's drunk the unit testing kool-aid has even heard of any of that?

wpietri · on July 5, 2012

You have to start somewhere. If I have to choose between a world where people never test and one where they test but aren't very good at it yet, I'll definitely take the latter.

Of course, I'm still filled with rage when I find a cargo-cult suite of unit tests. But it's so much easier to convince those people to write smarter tests than it is to deal with the team that has no time to write tests because they're spending all their time debugging.

alttab · on July 5, 2012

By being able to design software that doesn't need or allow crazy data configurations.

My whole mantra is don't manage complexity: avoid it.

gouranga · on July 5, 2012

You nailed it - thanks for the absolutely spot on comment.

Design is the key word here. It requires thought, it requires intelligence and it requires multiple people's input.

TazeTSchnitzel · on July 5, 2012

You're right.

Now that I think about it, I doubt Facebook is that careful, considering how often they break things.

bnr · on July 5, 2012

You don't. Reproduce the issue on a development instance.

damncabbage · on July 5, 2012

I first took TazeTSchnitzel's comment to be sarcasm, but now I'm not so sure.

(Poe's Law says hi.)

TazeTSchnitzel · on July 5, 2012

Not sarcasm. Of course you should reproduce in a development environment, but for some issues, looking at live DB is the only way to see what is wrong.

mgkimsal · on July 5, 2012

A system I work on has grown a lot over the last few years (data for 2000 users is now data for 50,000, rules have been added, etc).

Example story: We had an issue that was only coming up on production. I could not reproduce it in a dev environment. Worse, we didn't even notice it for a long time because it was a nightly job, and people were not reporting an absence of their notifications. (nightly job to email reports).

Finally got a dump of live data to dev system (it's a lot of data, so I don't pull it all the time). Someone had updated their email address to something invalid, and the system threw an uncaught exception in the middle of the loop (yay java). So... half the people were getting their stuff, the second half didn't - guilty record was a user with last name starting with L.

Yes, we should have prevented a bad email from going in with validation, but it slipped my testing (and the client's). It's just shocking to me today that professional people whose job it is to send and receive email can mistype their email. "john smith @yaho" is not valid, yes I should catch that, but someone typed that in. Adding on top of that was my own dynamic language background not mapping well to the JVM - one bad address in the middle of a loop doesn't just get skipped and logged, but the entire process now stops.

Multiple lessons learned from that one certainly (logging, validation, exception handling, etc) but... it would have been a lot longer for me to even consider putting in an invalid email address (it worked in dev - it was working for end users, etc) - pulling live data was the only thing that made it apparent.

wpietri · on July 5, 2012

I agree totally with your broad point, but wanted to respond to the "I can't believe a user typed..."

Even if your 50,000 people are all pretty smart, you're well out into the range of exceptional circumstances. If I do something once a day my whole life, that's still only 25,000 times. It's pretty easy to imagine somebody on their worst day doing something like that. E.g., You go to visit family, so you're jet lagged. You had a couple of late beers with your brother, so you're hung over. His baby is screaming with colic and the toddler is banging pots together. You're VPNing in from their kitchen table, trying to fill out some form, and in the middle somebody knocks over a glass of juice. You yank your laptop up, help clean up the mess, and then sit back down to finish. While the pot-banging proceeds at full volume.

Madness, sure, but common madness.

tedunangst · on July 5, 2012

Adding on top of that was my own dynamic language background not mapping well to the JVM - one bad address in the middle of a loop doesn't just get skipped and logged, but the entire process now stops.

Makes me wonder what dynamic language you're used to. An uncaught exception in both python and ruby aborts the program, not just the tiny inner loop that happens to be running.

mgkimsal · on July 5, 2012

Primarily PHP, although I seem to remember classic ASP VBScript not dying if you passed a bad email to CDO (but that's been a long time, so i might not remember that correctly).

PHP would not throw an exception if the mail function got a bad email address. well, using some mailing libraries might, but the core mailing function (basically just a wrapper for sendmail/postfix CLI stuff) wouldn't.

tedunangst · on July 5, 2012

Oh, you expected silent failure from a crappy library. :) I think Java (or rather, the library you're now using) made a much better decision to not blunder along.

damncabbage · on July 6, 2012

"on error resume next" redux?

gouranga · on July 5, 2012

> Adding on top of that was my own dynamic language background not mapping well to the JVM

There's your problem...

mgkimsal · on July 5, 2012

It's a problem, but it certainly wasn't my only - groovy, to be precise.

rimantas · on July 5, 2012

How do you do that, if issue is caused by some corner case with user data you did not foresee and hence not covered?

Goladus · on July 5, 2012

For some applications, the number of corner cases that can not be resolved by reading log files is minimal. (you are writing logs, right?) That is, too, assuming the problem is actually in the code itself and not the infrastructure.

epriest · on July 5, 2012

See this question on Quora for some clarifications from people who are or have been Facebook employees (including myself). The article (particularly the original version) has a lot of inaccuracies, and is now around 18 months out of date.

http://www.quora.com/Facebook-Engineering/How-accurate-is-th...

michaelmartin · on July 5, 2012

I really like seeing this approach listed:

"resourcing for projects is purely voluntary. A PM lobbies group of engineers, tries to get them excited about their ideas. Engineers decide which ones sound interesting to work on."

That sounds exactly the same as how Github's engineers work.

It's an awesome concept; no-one can justifiably be bored with their projects if they chose them. And if you can't get anyone interested in working on the project, then it's a good indicator it may not be a project worth completing for the company anyway.

I'm sure there are times when someone has to say "We need someone to do this", but I'd be curious to hear from someone who works in one of these environments how common an issue that really is.

jacques_chester · on July 5, 2012

> It's an awesome concept; no-one can justifiably be bored with their projects if they chose them.

Eventually people lose interest. It's only human. But The System has been written, it is in production, and it has accrued a healthy stock of user data.

One day, The System breaks. "Only" tens of millions of users -- less than a percent of all Facebook users -- rely on the The System. But they rely on it utterly.

Who will fix The System?

Of the five engineers who wrote it:

* Jack and Mandeep left to launch myfornaxisnatrr.com

* Wei Li has moved to a different group

* John doesn't want to touch it, he only came on board to help Jack

* Michael was an intern but has since taken a job with Google.

Uh oh! There's nobody around to voluntarily fix an existing system. That's boring, and there are no incentives for fixing bugs in obscure features because only launching successful new features gains visibility from higher-ups.

Guess we'll need some mean old managers to round up a posse.

If they care enough.

Meanwhile, The System has acquired millions of users, cost millions of dollars to develop and operate and will now abruptly cause tens of millions of customers to become incredibly frustrated. And at no point has anybody stopped and asked:

"Was this the Right System to build?"

angstrom · on July 5, 2012

> That's boring, and there are no incentives for fixing bugs in obscure features because only launching successful new features gains visibility from higher-ups.

Companies really should find a way to incentivize this. Rewrites are about as silly. "Write the next generation xyz." Translates to "No one here has any idea how xyz was written, so it's being rewritten instead of modified. We look forward to rediscovering the same problems we did the first time."

jacques_chester · on July 5, 2012

> Companies really should find a way to incentivize this.

It can't be done from the top. You need to start with engineers who care about quality over the long term. It has to come from within each engineer.

When Zuckerberg talked about younger programmers being "better", he probably meant "more like me". But old farts are just young farts with expensively-acquired scar tissue.

Most of our most treasured software development lore comes from the lesson that "move fast, break things" just devolves into "fuck, everything is broken, fix it fast".

josteink · on July 6, 2012

"Write the next generation xyz." Translates to "No one here has any idea how xyz was written, so it's being rewritten instead of modified.

Sometimes, when software has been around for 10+ years, stuff does get too old. It happens. For real.

I know people seem to forget that here with startups having been around for less than a year being "old players" and all that, but common.

Software may have been written with a bunch of presumptions which was valid when the project was initially started, but no longer are that. Maybe it was written on top of a platform which is no longer as productive as it was when the project was initiated. Nothing kills enthusiasm and productivity like working on a platform no longer deemed modern.

And then you have technical limitations and pre-conditions. Once, servers was expensive and you wanted your solution to distribute it's load to clients so that you could save on servers. Once servers was the only machines powerful enough to complete time-critical tasks within reasonable time-frames, so you architected the software accordingly.

Now, we suddenly have HTML5 and webworkers and shit, so you probably want your solution to be a web-solution with a distributed computation model. I.e. full-circle. Full rewrite. Twice. What's next, I don't know.

Sometimes a rewrite is just right. Saying that every rewrite is based in incompetence is simplistic.

mkjones · on July 6, 2012

I actually haven't seen this much as an issue at Facebook (I've been an engineer there for a few years). There are some (usually very smart) people who care a lot about systems working reliably, and almost always seem to be a few who are willing to jump on issues like this.

One of the advantages of the "bootcamp" approach outlined in the OP is that people feel empowered to jump from product to product, so if something you care about breaks for a small set of users and you wish to fix it, you can do that. As a few others have pointed out, it doesn't hurt to mention this at review time, and people are often publicly recognized and thanked for these "thankless" efforts.

jrockway · on July 5, 2012

The world doesn't work like this. In addition to doing-whatever-they-want-24-7, people are also trying to get promoted. That means there is always an incentive to do some task that nobody else will do. This kind of causes talent to even out between high-profile new projects and not-as-fun legacy projects.

jacques_chester · on July 5, 2012

It doesn't work like this in places where shit being broken is something management cares about.

In places where the goal is to make engineers happy and to impress other engineers, there are incentives for behaviour that are perverse from the customer's point of view.

michaelmartin · on July 5, 2012

I get what you're saying about the likelihood of developers having moved on to different projects/companies, but I don't think that issue is specific to this approach in any way.

Good software should be written on the assumption that someone else is going to be editing your code at a later date anyway. If it's highly readable, then being unable to grab the original developer is less of an issue.

This is one of the reasons for the peer code reviews during the original development.

I doubt it's really that hard to find developers to work on issues like this. I've always thought the idea that their work impacts millions of people, even though that is a tiny percentage of the total userbase like you said, was one of the big attractions of working at Facebook. If I knew of an issue impacting that many people, I'd be delighted to have solved it.

moondistance · on July 5, 2012

I'm curious - what happens when an engineer wants to propose an idea? Do they need to get a PM on-board to lead the project? Who decides if the idea is good?

The PM/engineer divide feels uncomfortable to me, but I have limited experience with these roles, so it's likely simply for lack of knowledge. I hope someone can clarify this for me.

michaelmartin · on July 5, 2012

From the article at least, it seems like there aren't a whole lot of PMs at Facebook.

If an engineer has an idea, they seem to be more encouraged to just find some people to work with, build it, and then roll it out to a very small section of Facebook's users and see how they react to it. Good ideas might then be taken further, but bad ones (Or I guess, ones people just don't care about) won't be.

(I don't work there though, so I have no way of knowing if that's right or not, sorry!)

wilfra · on July 5, 2012

From the article: 1/10 PM's to Engineers and 500 engineers = 50 PM's.

That's not an entirely small amount for Facebook since they don't have a zillion different products and features that need to be owned, like a company like Microsoft or Google.

TezzellEnt · on July 5, 2012

This sounds to me exactly how Valve and other organizations operate (http://blogs.valvesoftware.com/abrash/valve-how-i-got-here-w...).

I think that if you get buy-in from the engineers working on the project it can induce ownership and a willingness to watch it succeed and work towards it's success. The issue occurs when there isn't adequate a/b testing (or something similar), which could create the fiasco such as facebook's switching of email addresses.

Companies such as Google allow 20% time to work on projects that might not make it to production, or get killed after acquisitions. I think that policies such as those work better than code being committed on the fly.

Goladus · on July 5, 2012

The facebook system actually sounds really solid, especially the "boot camp" thing that so many companies fail to have, however it's probably pretty expensive. Choosing between fast, cheap, and good, Facebook is choosing fast and good. For many, cheap and good but slow is more desirable.

One of the risks to consider with a "devops" oriented approach is that you may become more dependent on it than you want.

Often, applications split things into a few different categories depending on how tweakable they need to be. There's code, configuration, app administration, and data. Code shouldn't need to change often. Configuration may need to change when the environment changes. App administration (eg creating new accounts) needs to change often, and data is always changing (or at least growing).

The risk is that developers will design the system so that only developers can administer it. Configuration, the settings that may need to be tweaked by sys admins long after the original developers have left the project, may wind up in the code or sometimes lumped into the database alongside end-user options.

It's not a reason not to take this approach, just something to consider when developing internal processes and culture.

jsvaughan · on July 5, 2012

Previously on HN: http://news.ycombinator.com/item?id=2594083

↪ How Facebook _actually_ pushes updates to the site

I came across this originally on the Etsy dev blog, rather than HN, and that particular post had some good other stuff about Flickr and Etsy:

http://codeascraft.etsy.com/2011/06/01/pushing-facebook-flic...

yawgmoth · on July 5, 2012

I like the idea of encouraging a high-performance culture, but I don't think the 'perform or die' atmosphere would be healthy for many engineers. I know, idolizing 'rockstar programmers' is a sort of new hotness and I understand that a company like Facebook wants to have super-talented developers, but developers grow and learn new tricks as they mature, and they might take more than six months to do so.

alttab · on July 5, 2012

True - but those engineers don't work at Facebook. This isn't no-child left behind. This isn't hand-hold time. This is the most expensive and expansive internet application in the world. If they need time to ramp up, they can do it on someone else's product and come to Facebook when they're ready.

wpietri · on July 5, 2012

That sounds impressively macho, but it's an attitude that has long-term organizational costs. HN just had a great article on how Microsoft's internal competition deeply harmed the company. The number one complaint I hear from departing Google engineers is the absurd internal promotions system.

Years ago I did a gig at eBay, and I thought their macho attitude was a giant source of problems. Plenty of good, sane people were driven off (or driven mad) by artificially high-pressure situations. Every email about a promotion mentioned how somebody had worked all night to get something done; they were promoting more for drama than for skill. And a "this isn't hand-hold time" attitude was common among senior technical staff, which meant that people often hid their weaknesses rather than getting the help they needed.

The lesson I learned from that is that software companies that take normal circumstances with the intensity of emergencies are gradually cutting their own throats.

That's a lesson reinforced for me spending a lot of time with a family member in hospitals last year. Even when survival was on the line, the best doctors and nurses proceeded with patience and kindness, working to train staff and improve systems as they went. In an actual emergency they moved like it was an emergency. But only then. If they can be serene and thoughtful while dealing with brain tumors, I don't think there's any reason that people at Facebook have to puff themselves up with self-importance.

alttab · on July 5, 2012

This is a good lesson. I was straight-forward to get a point across, but it is easy to see how "excellence first" could turn into a poor work environment without moderation.

It also seems that an "excellence first" culture seems to provoke fears that people can't learn, or aren't allowed to make mistakes. Tone and delivery are important here - especially from leadership. My take on excellence first is you don't have to compromise environment for it. Create an environment that fosters excellence, where people aren't afraid to make mistakes but are committed to learning from them. Everything doesn't need to be treated like an emergency, but theres a level of bullshit on a technical level that shouldn't be tolerated. Again, each situation needs to be felt out individually and moderation is always key.

Personally I don't advocate the internal competition aspect of excellence either. The pitfalls of that are well documented as you have pointed out. I'd promote based on team success, and THEN individual contributions. If you can't work as a team then the whole team loses, only once the team succeeds would you start singling people out to see where the best influences were coming from.

TLDR - I think there is a middle ground between not running it like a daycare and running it like a wall street trading firm or gulag.

dot · on July 5, 2012

the most expensive internet application? what do you mean?

Daniel_Newby · on July 5, 2012

"This is the most expensive and expansive internet application in the world."

Facebook uses chat/pictures to sell ads and cheesy games.

Their entire value proposition comes from surfing a wave of attention. What they need are scruffy surfer dudes who keep an eye on the weather, show up at the likeliest beaches, and never stop riding the waves. Anyone with a modicum of skill who keeps paddling out to try again can make a real contribution.

What they do not need are rockstars, certified supergeniuses, arrogant hotshots, etc. You can't buy or impress your way to the big waves, you have to ride there.

If you try to keep winning by being awesome, you will wipe out. For example, Netscape, AOL, Alta Vista, Geocities, MySpace, and many others.

alttab · on July 6, 2012

I never said what they did was important. I said it is the widest used, most distributed, and most encompassing software on the planet.

What it provides is trivial, how it does it is not.

DigitalSea · on July 5, 2012

Don't get me wrong a developer-led company in theory sounds like an awesome idea because the developers know the product better than any product manager ever could, but that isn't a good thing. I've worked with companies that have ultra-strict testing of features, where many eyes see and use the code before being pushed out and while that process works it can get in the way of progress when the politics come out to play.

The whole move fast, make mistakes mantra is a bad approach when you're a high-trafficked site like Facebook that risks jeopardising your revenue. A second of down time can be very costly. Knowingly pushing out half or completely untested features might be acceptable if you have a small user base, you're a new Internet startup or your logo says alpha or beta, but certainly not an established brand and company it's stupid.

This is all my own opinion, of course.

alttab · on July 5, 2012

I would argue the opposite. Facebook is in prime position to move quickly and break things. They have enough traction, gravity, and brand recognition they can afford to make mistakes.

If you have a small user base or are just getting started, delivering the best product experience is the #1 thing you can do for success. Is iterating faster and making more mistakes and giving the early adopters (ones that champion your service the most to others) a sub-par experience worth the speed? Sometimes, yes. All the time? No.

Facebook can change your privacy settings, steal your e-mail address, recognize you automatically in photos, watch your browsing activity outside of their walled garden, and sell your data to advertisers. And yet they still have 500million+ users.

five_star · on July 5, 2012

"very engineering driven culture. ”product managers are essentially useless here.” is a quote from an engineer. engineers can modify specs mid-process, re-order work projects, and inject new feature ideas anytime"

Maybe this is why Facebook seems to be chaotic for users. They change and change to whatever design they wanted without much consideration about what the user's would feel about the design. Facebook has now become the combined features of the other existing social media.

krosaen · on July 5, 2012

""" Engineers responsible for testing, bug fixes, and post-launch maintenance of their own work. there are some unit-testing and integration-testing frameworks available, but only sporadically used. """

Sounds like a lot of code debt accumulating that could bite them hard down the road - it's one thing to write and manually verify bug free code, it's another for a different engineer to make sure he/she doesn't break that code inadvertently a year later when the original author has moved on to another project or company. I'm not talking about 100% test coverage; if the smoke test for a feature breaking is someone noticing while playing with the site, in the long run it strikes me as a much less efficient way to verify and fix regressions than using an automated test suite. Writing good tests is hard, but keeping a product bug free as more and more functionality accumulates without automated test suites is even harder in my experience.

mkjones · on July 6, 2012

This article's about a year and a half old. We have pretty good unit test coverage on a good chunk of our code (especially core stuff), though admittedly not everything.

Some groups put particular emphasis on this (e.g. the messages team is great about testing), and it shows in the reliability of their products. Even better, they end up building frameworks that make it easier for the rest of engineering to write tests, and drive the whole ecosystem forward.

krosaen · on July 6, 2012

Good to hear. Related: a good article by Eric Ries on how in many situations within a startup technical debt can be used effectively.

http://www.startuplessonslearned.com/2009/07/embrace-technic...

peapicker · on July 5, 2012

I stopped reading at "very unique". It is unique, or it isn't. Intensifiers to 'unique' tell me the writing is below par; and I've been correct about this enough over the years that I stopped bothering.

Dybbuk · on July 5, 2012

Well, I'll be joining Facebook in a few weeks. I am a bit of a laid back type and don't know if I fit into their culture of moving fast.

its_so_on · on July 5, 2012

This is the real Facebook secret sauce in convenient flowchart form:

    What's the most evil thing we can think of doing?
     V 
    Candidate <-----<------- Think of the next-most evil thing we can do
     V                                     ^   
    Can we code that? (no) ------->------->^
     V (yes)                               ^
    Is it legal?                           ^
     V (yes/no)                            ^
    Can we get away with it? (abs. not)--->^
     V (yes/maybe/prob. not)               ^
    Keep shipping!                         ^
     V                                     ^
    Did we get in trouble for it? (no)     ^
     V (yes)                        V      ^
    Claim it was a mistake!         V      ^
     V                              V      ^
    Still in trouble? (no)-> Keep feature->^ 
     V (kind of)                    ^
    Being sued for it? (no) --->--->^
     V (yes)                        ^
    Throw money at lawsuit          ^
     V                              ^
    Did we lose? (yes/no) ----->--->^

alttab · on July 5, 2012

Not very constructive to the conversation, but I gotta give it to ya, this is pretty funny.

tfb · on July 5, 2012

Sadly, that flowchart has some truth to it.

I asked this in another (semi-dead) thread and it didn't get a response, so I'll ask again as I couldn't find any information on it. Was any legal action taken against facebook in the recent e-mail hijacking fiasco? I think there should be, since e-mail has pretty much replaced snail mail in terms of business communications, which directly affects the livelihood of many people. Companies like facebook with so much power at their fingertips should not be able to get away with doing things like altering millions of people's address books for their own selfish reasons.

gouranga · on July 5, 2012

That's great - thanks for posting it :)

gouranga · on July 5, 2012

If you're going to down vote, please at least say why...

bajsejohannes · on July 5, 2012

Comments on HN tends to be voted down if they are not substantial. To quote http://ycombinator.com/newswelcome.html: Does your comment teach us anything?

That page also says that a simple thanks may be acceptable, but the consensus seems to be that those comments are superfluous as well.

sodelate · on July 5, 2012

what does this mean?