I wish my blog would get picked up when it is NOT making me look like an idiot. sigh
Edited to add: So I was trying for a less pithy comment when the phone rang from overseas. Sorry about that.
Anyhow. Yep, this story really did happen to me. The good news: a one line tweak to a little file I didn't touch in something like six months probably added about 50% to my sales. The bad news: in those six months, this probably had actual economic costs of several thousand dollars. Or, to put this in relative terms, the worldwide economy just freaking collapsed and I still managed to do more damage to my net worth with a freaking typo than the total massacre in my IRA.
I don't think this makes you look like an idiot. I submitted this article from your blog because it is insightful and shows how apparently small, unimportant "display errors" can have very real consequences in terms of traffic and revenue.
Getting cron right can be trickier than it looks. Even if you test your script with the right uid, are you sure cron is running with the same environment variables as the shell? I've been bitten several times by cron skipping my shell profile.
I just wish my cron scripts had the same earning potential :)
I know the feeling. That's actually a major reason why I keep the blog up -- it is probably my best single record of what the heck I was thinking when I implemented something. I find explaining things to other people leaves better hints than my cryptic "Of course I'll know what I meant in a year! After all, I wrote this comment!" documentation.
Embarrassing story from someone I know: if you write your own billing application, make sure you round numbers correctly, or you could cost yourself hundreds of dollars a month in under-billing clients.
I'd say that's a nice problem to have.. if you're under-rounding by one cent enough times to cost yourself hundreds of dollars, that would mean you're making at least tens of thousands of dollars of sales...
In the particular manifestation of this bug, he was under-charging by much more than one cent. It was thousands of dollars a month, not tens of thousands.
I noticed a couple of utility providers making the "rounding errors" in their favor. I sensed evil intent on their part. I never thought it would have been a genuine software bug.
I write billing software for a utility provider so.... yep, it's a bug. Usually that sort of thing manifests when data moves between systems with slightly different rounding rules. There's a ton of integration points to third-party systems and they don't always agree about how these things work. Fractional things like taxes, prorating, and earned vs. deferred incomes make precision errors very easy. We've got sanity checks in the code to make sure nothing crazy is going on when we get results from third-parties but so long as any errors are within certain margins the accountants using the software are happy.
Incidentally, accountants are the best users ever for bug reports. Their eye for detail and the multitude of reconciliation processes they use to analyze the results of your processing is fantastic for finding subtle bugs.
Everybody should have an accountant. They had to run somewhat buggy software (tax laws, bilancing laws etc) on unreliable hardware (humans) since a long time ago.
Apart from being alert, cautious and the stress on testing, the takeaway for me is
"this bug made a live, growing website look largely dead to the world, and Google accordingly sent most of their searchers to get their bingo cards elsewhere."
Most web businesses need to understand how google search works than their own businesses.
I don't see why I who commented on a valuable takeaway from the post should be downmodded for this and your comment is upmodded. Yours is the one irrelevant to the post.
PG, please add a official Don'ts in formatting. It is nothing but natural to quote elsewhere on the web.
Trying to be more specifically helpful, since I don't think I conveyed exactly what the objection was the first time (or at least not so well that you understood):
It's not that you quoted. It's that the quote is wrapped in <code> and <pre> tags, and that breaks the layout for lots of people. I'm asking you not to do that, because its effect is obnoxious.
By intelligently, you mean by letting the text vanish instead of putting in a horizontal scrollbar? While this is my preferred behavior, I don't think I'd necessarily call it intelligent.
So being that this is Rails there are a couple of ways I could have done this (assuming, again, that I cached):
1) Cache the page in memcached, which would give me access to an easy time-based expiry syntax.
2) Page cache. Instead of cleaning the cache with a one-line cron script, write a 10 line cache sweeper. Then, install a plugin to periodically execute the sweeper... or execute it with a cron script.
3) Page cache. Delete with a cronjob.
Why I didn't do #1: Memcached is overkill for my needs, and keeping another process running on the VPS just gobbles its 256MB of RAM even faster, while giving me another potential security/uptime headache.
Why I didn't do #2: Doesn't eliminate the dependency on cron, it just moves the actual delete operation from a single line in the crontab to more complicated Sweeper logic in the app. That is logic I'd then have to actually maintain and whatnot. (And doing #2 wouldn't have helped me if I had executed the Sweeper job as the same user I executed rm -rf as, because it would similarly have been unable to remove the file.)
I might be careless but I don't make choices totally at random. ;)
"keeping another process running on the VPS just gobbles its 256MB of RAM even faster"
Why not upgrade? On Slicehost, going from 256 to 512 MB is only $20 extra. If you're running a business on it, I think that's a pretty reasonable expense, especially if it can save you even a little time (within reason).
I never said you made it at random, just that even if it were the best option it was still a poor one. From the options you laid out, it appears none of them were particularly suited to your situation. If it were me, upon realizing that, I would have just waited to see if caching were really necessary before instituting a cron hack.
Removing them with a cron task is usually the worst way to solve this problem. Using the expire_page method is the correct solution when the updates are being done as a result of a user's action (updating/deleting/creating a record). The problem here was that they weren't being done by a user, but automatically by the system.
There's nothing wrong with premature optimization. Who knows if your site will ever get hit hard by Slashdot, Digg or the like. If your site isn't up to par, you'll crash and all of those people, and possible ad revenue, will turn away.
The problem here is when you don't test your code properly.
There's everything wrong with premature optimisation. Optimisation isn't free. It costs in terms of time and code elegance and versatility - often once you've optimised a piece of code to be fast, it's more complex, harder to understand, and harder to change.
Premature optimisation is bad because of this.
Taking things one step further, what is premature? Well, since your production architecture (if you're a start-up on a shoestring) may well be vastly different from your development architecture or your staging architecture (you may run your production site on a clustered EngineYard set-up that costs hundreds of dollars a month and your staging site on a VM that costs $20). And even if the architecture is identical, the data volumes and patterns may be different.
And even if you can afford an exact duplicate of your production environment, and keep the data exactly in sync with your production system, usage patterns may be different. And even if you can afford to simulate the usage patterns in a realistic manner, there will be differences in how the traffic gets routed on an IP level which will also cause the bottlenecks to shift.
All this to say, if you're a start-up on a budget, do not optimise until you see a problem in production. You may well be surprised by how it's not the bottlenecks you expected that actually slow you down. My application has no indexes in the database, because so far, database access has never been a bottleneck. On the other hand, ruby parsing of objects, and AMF rendering of them, and complex aggregation operations on the objects - all those have caused problems that I only found once I was able to observe them happening in a production environment.
Simple solution: don't optimise before you see the problem begin to develop in production.
I think we're being a little harsh in this case and really overstating never optimize prematurely like it's some kind of religion.
Optimization is built in to most web stacks. Database caching for example. For rails this is convention over configuration when you're running with the production flag. Plus there's all sorts of other optimizations going on at every level, because it's suspected that even if a lot of applications don't need them, to not bake them in will cripple and add undo complexity to other apps. This adds tons of complexity, but it's well abstracted. And many of our apps don't need this... so it's premature definitely, but it's a pretty fast and fun stack to build with.
I think it is completely acceptable to perform certain types of optimization, even prematurely perhaps, provided the cost of these optimizations is low if you think there's a decent chance, or have a hunch that you could get into trouble if you don't. Experience teaches you which assumptions to make, but it's definitely not a perfect science. Sometimes you overdo it and there's a cost, but there's also a cost if you underestimate as well. It's an effort we make as developers to balance between the two.
So in this scenario, caching some pages to me sounds pretty acceptable, or at least border line. He implemented the caching in a very simple, basic, and well known commonly accepted way. No memcached or anything fancy, just static page dumps. He's running on a 256 VPS slice, which is pretty slim for a business, so it's not too hard to imagine he'd run into problems if there was even a small spike in traffic. Plus, his traffic is decent. I for one would probably have cached as well, at least the landing pages anyways.
He got in trouble sure, and as a direct result of his caching, but these things happen when you're running everything yourself. I don't think it's due to bad judgment.
This reminds me of another statement that is often taken to the extreme. "Launch quick and get it in front of people. Don't assume you know what your users want." I think it's a pretty good reminder once you start to go too far, but of course we all know you still need to make many assumptions just so you have a product. There's not much value in the canvas alone. These little gems like "Never optimize prematurely" won't do your job for you. The reality is too fuzzy, and that's why you get paid.
I hope this doesn't sound like a harsh reply. I do agree that there are many cases where people go crazy with optimization, features, etc, and it's good to have a little mantra like "Never this, never that". I just don't think he's guilty of any of these in this case is all. The cron was a blunder many of us have experienced. It cost him though, and he's learned that lesson well now I bet. I always learn best from my most costly mistakes.
It looks like all the 'my fail story'-blogpostings are quite elaborate for what they're trying to say.
With this story the same, it reads like a breeze but all he says is:
"Hey, I didn't delete the cache everyday because of a fault I made in the crontab. That costed me pagerank and visitors."
Doesn't mean that I don't like reading those stories =)
Little tweaks that speed page loading can also result in surprising increases in traffic -- both visitors who stay longer and improved search-referrals. I think fast-loading pages get a ranking boost from Google, too. (They're very explicit that this is the case for AdWords landing pages.)
Although the bug fix was only one line, it apparently necessitated a blog post that was hundreds of lines long. Perhaps the author should ask Gavin Newsom to produce a 7-hour series of videos about the bug fix?
Long story short: I thought "a new card every day!" was an important hook when designing the site. User behavior indicates otherwise -- most don't want novelty, they either want a) the same card everyone else wants (because it is seasonal) or b) a particular card which they Googled for. Only about 8% of my users even see the "featured card of the day" page.
So yeah, on the face of it, this particular feature breaking is unimportant to most of my users. However, Googlebot is MUCH more impressed with novelty than 92% of my users. Given Googlebot's outsized impact on my business, it is sort of an important stakeholder :)
Edited to add: So I was trying for a less pithy comment when the phone rang from overseas. Sorry about that.
Anyhow. Yep, this story really did happen to me. The good news: a one line tweak to a little file I didn't touch in something like six months probably added about 50% to my sales. The bad news: in those six months, this probably had actual economic costs of several thousand dollars. Or, to put this in relative terms, the worldwide economy just freaking collapsed and I still managed to do more damage to my net worth with a freaking typo than the total massacre in my IRA.