Indeed. It has always amazed me that so many people treat their static content as something that needs to involve a database.
Look at how many blog entries come through here and have the entire site fall over from a few thousand people going to read them. Somewhere there's a poor slicehost instance performing "select * from blog_entry where key like '10_reasons_rails_is_awesome'" thousands of times in a row until something melts.
Worse, the proposed solution when this happens is to add caching.
No, the thing to do would be to add a .html file to your webserver. I defy anybody to find a modern web server that can't serve a static file a thousand times per second from the smallest VPS slice on the market.
It's a solved problem. But people keep unsolving it.
I defy anybody to find a modern web server that can't serve a static file a thousand times per second from the smallest VPS slice on the market.
Apache 2, with KeepAlive on, using the MTM that will be installed if you have PHP running on your system. Specifically, when you try serving a static file a thousand times per second to a thousand clients, your first 150 clients are going to saturate all available worker processes for 15 seconds, then your next 150 clients are going to saturate all available worker processes for 15 seconds, then ... most of your users will see their browsers timeout and it will appear that the server has crashed.
n.b. This is the default configuration you'll get if you do weird crazy things like "sudo apt-get install apache2 php5" on Ubuntu.
Edit: Added victim site below if you want to test this.
[Edit the second: I fail at reading comprehension of my config file at 4:30 AM, and underestimated the number of simultaneous connections it could take at once. Still, I'm pretty certain this is technical reality after the numbers are corrected.]
If anyone actually tried to run Apache/mpm_prefork with "MaxClients 150" on an average Linode or cheap dedicated server, they'd OOM and start thrashing as soon as someone started requesting a couple dozen PHP pages -- even if the rest of the requests were to static resources. That's just another way to get yourself DoS'd. Received wisdom on the Linode forums is that you should handle no more than 10-15 simultaneous connections if you're stuck with Apache/mpm_prefork and PHP. So your original point stands.
I was under the impression that the 15 seconds timeout for php was if a page took longer than that to load, apache would kill it. I thought the worker processes would be available to serve content immediatly after its done serving a page.
I disagree that high-performance blogs are a solved problem in distros - depends on who's linking to you.
Wordpress on Apache melts on a small VPS under a few hundred hits per second, using gobs of memory for each call. So you turn on supercache etc. and it gets a little better, for a lot of application complexity.
Now put varnish in front, override some of the cache-ability headers from your application, and my experience is that when e.g. Stephen Fry's twitter links to the site, your site becomes CPU or network-bound instead.
From memory, from maintaining a friend's site, the number of simultaneous connections to melt down the server (using siege, on a 4GB system) were something like 300 connections without any optimisation, double that if Wordpress had spat out .html files and Apache was serving, but with Varnish in front, it started to slow down at around 2000 connections.
There's a reason you want to serve your application from a database - it's nice and easy to change your pages on the fly, but serving static files through Apache is hardly the best you can do for optimisation.
Although I take your point that people do a lot of needless computation these days, I don't quite understand the dig at caching. Isn't adding an HTML file to your server just manual caching?
Pretty much. Only chances are it's not actually manual, either.
The way I do this for the CMS/Blog stuff on my products is to set up a 404 handler that looks for urls that seem like they should be blog entries. So if it sees a request for the non-existant http://mysite.com/blog/caching_is_awesome.html, it'll do a quick check for that article in the database, and if necessary create the static .html file before redirecting to it.
It's a little nicer than caching because you only need ever lookup/create the thing once each time it changes. From there, it's just the webserver being a webserver. No need to involve any application layer at all.
> The way I do this for the CMS/Blog stuff on my products is to set up a 404 handler that looks for urls that seem like they should be blog entries. So if it sees a request for the non-existant http://mysite.com/blog/caching_is_awesome.html, it'll do a quick check for that article in the database, and if necessary create the static .html file before redirecting to it.
You have done the webserver equivalent of method_missing trickery in Ruby.
Even a minimal blog written in Rails can cache to .html files with a page caching line in the PostsController and some lines for expiration when new posts are created/updated/destroyed.
Serving static files is more about pre-processing, or rather:
Don't do at run-time what can be done at compile-time.
This doesn't need to be any more manual that the labor involved in uploading content into a CMS or data store by another name (e.g., blog engine). It's more of a substitution of one task for another.
I prefer to think of the static HTML approach as the difference between using Makefiles versus a shell script for compiling code. Once you understand that the shell script will duplicate nuances that 'make' offers, use of Makefiles for compiling code usually gets viewed as the better approach. (There are always edge-cases, of course.)
But what I'm saying is, doesn't caching accomplish essentially the same thing? You process the resource once, cache the result and serve that. Full static site generation just seems like a more aggressive version of the same thing, and AFAIK you have to give up some of the benefits of being partially dynamic (like the ability to do have a "Latest comments" sidebar without unnecessarily relying on JavaScript).
I think there's a semantic hangup here, what you're describing is accurate, but it probably "better" explained (in the case of something like the popular wordpress caching plugins) as "compiling" your content to flat HTML which is saved on disk. The web server/CDN can then "cache" that file.
Yes, end results are equivalent, but complexity shifts.
If a consistent stack is more important, caching may be the best option. If fewer layers of complexity at run-time is more important, out-of-band/pre-processing may be the best option.
> It has always amazed me that so many people treat their static content as something that needs to involve a database.
It's not static. Maybe the text is, but the entire HTML document is plenty dynamic. You always have your list of recent blog posts, often some dynamically calculated dates ("posted yesterday 25 hours ago"), some quote-of-the-day or other banner, and maybe advertisements. Personal blogs might be able to get away with HTML, but at a corporate level (even company-blogs like MSDN's) that doesn't fly, where tons of hands are in the pot all with their own widgets and contributions that must automatically go on every page of a site.
What you really want is a CMS with a compilation step that outputs static HTML each time something changes. That gives you the flexibility of database-driven CMS with the runtime performance of static HTML.
I'd argue that you can get away with it in many cases.
* Disqus for comments.
* You re-generate the static site on every deploy. "Recent" can be regenerated every new post.
* It's easy to convert a date into "2 days ago" with javascript.
* Quote of the day, or other banner can be javascript too.
* Advertisements are almost always javascript.
Im not seeing the need for a database in anything you described.
I've had this same problem myself, you can indeed get around some of it by using javascript but I'd rather not use JS for something unless I really have to. This is because you often end up with a slower site to the end user as you end up serving extra JS files (sometimes including jquery) not to mention noscript users etc.
I think the problem is that stuff that starts being pretty static can often end up getting slowly more dynamic and at some point you have to re-engineer it which becomes a pain so it's easier to plan around this from the start.
I took over a project which used static content to serve everything which included a lot of fwrite() PHP calls. As this got more complicated we ended up cache invalidation type issues where there was a hierachy of content that needed to be re-written back to various files every time something was changed on the site.
This meant that saving changes to the site became incredibly slow as we often erred on the safe side which meant we ended up re-writing some things multiple times, also the code that checked files to see which parts needed to regenerate became exponentially more complex.
In the end I just generated everything from the DB and used memcache in a few select places, performance for serving content was about the same as serving the static files, the code was much cleaner (which helped make performance better in the long run) and usability from the content administrators POV was much improved.
You should always aim to be serving your most common content directly from RAM anyway, so whether this is from the kernel's pagecache or memcache doesn't matter so much, you can probably solve this problem using clever proxying too.
> What you really want is a CMS with a compilation step that outputs static HTML each time something changes. That gives you the flexibility of database-driven CMS with the runtime performance of static HTML.
> Worse, the proposed solution when this happens is to add caching.
>
> No, the thing to do would be to add a .html file to your webserver.
Isn't adding an html file the same as caching? Only with different tools? I see no difference between writing a blog article in an html editor -> saving it to a html file -> uploading it to the server, and writing a blog article in a blog software -> save it to the database -> publish it as html file.
Any time I set up a new website, I use WordPress with a theme from WooThemes. It's just the easiest thing to set up and make beautiful. A thousands times easier than using static html to make sites.
And ease of initial deployment is the only thing that matters to most of the population. The fact that this configuration may theoretically break one day? Who cares, as long as it takes me 3 hours to get it up and it's fixable.
Thanks to modern JS & HTML, static sites can also be a lot more dynamic than the boring old static sites of the past. You can integrate things like DISQUS without any dynamic content at all.
Yeah, they're gearing up to a new product launch. Is this supposed to be bad?
Besides, that only tells up that they post a lot of new content for their blog. Nothing about how this content is voted up on HN --which, I, presume, is simply by folks finding it relevant and upvoting it. Or, do you think that they have their employees doing the voting to their own articles. Because certain people certainly imply that. To which, I respond:
1) HN is a social news site for hackers, people vote on stuff the like, and DHH et co are popular in Ruby/Rails hackers to say the least.
2) HN is a social news site for hackers. You really think a company like 37 Signals thinks they'll gain something regarding new customers by having it's articles linked and discussed here, and so much that they'll devise such a scheme?
It is news when arguably a company that created the most well known web framework posts a blog entry titled "No framework needed". If you look for a TL;DR it would be "Use the right tool for the job. Do not be dogmatic about your technological choices."
I've been looking at doing the same thing myself. I noticed you don't have comments on your blog. I guess the solution would be a 3rd party plug-in like disqus?
"Now, a number of tools are challenging that assumption. Movable Type, the program that runs this weblog, has a series of Perl scripts which are used to build your webpage, but the end result is a bunch of static pages which are served to the public."
These days, you can do so much more with off-the-shelf client-side components too, e.g. comments via Disqus et al, tracking via Analytics et al. It's really the best approach for most
I find it interesting that the static website ever went out of favour for this kind of site. The simplest approach is usually the best.
I have one website that is compiled using make and sed. Tools most programmers already have on hand. It took no virtually no time to setup and it gets the job done quite gracefully, while still leaving room to be integrated into a more complex system should the future site's needs dictate it.
I think the evolution was driven by the desire to have non-technical people be able to maintain websites. The easiest way to do this is a web app, and once you have a web app, making the entire site dynamic is (a) easy (b) opens the door to tons of shiny possibilities.
Obviously, there were significant efforts in the maintainable static site space, such as FrontPage and Dreamweaver, but I think those ultimately lost out to the flashiness of dynamic websites.
We don't see the exact numbers on the technical side, and my hunch is that pretty much all of the 500ms was saved because of client side optimization.
He mentions spriting some images, that only saves several HTTP requests, often translating to 200-3000ms quicker total load time. (Note that we're talking about full page load times, from the first DNS query to DOM ready.)
I agree Ruby is terribly slow at initializing, but it is not going to be the bottleneck when you run your own dedicated boxes (and 37signals does).
Good job on increasing the conversion rate, by the way.
Edit: Also, it doesn't matter if you precache or cache on-the-fly. You should be serving the homepage from memory (and any static pages for that matter).
I love this approach. I'm working on rewriting one of my webapps and I'm taking this concept to the extreme.
The website part will have no backend. Just static files served by Nginx. Even much of the data can be served from .json files. Where true dynamic content is needed, search for example, a minimal API will serve that content, sitting on another server, on another domain.
I switched both my blog [1] and our engineering blog [2] over to Jekyll and couldn't be happier. It's incredibly easy to deploy (we push our blog straight to S3) and working locally is a breeze.
Wouldn't caching do a similar thing? I guess you'd have the complexity of removing it from the devserver (so you don't have to wait for the cache to invalidate to see changes), and cache invalidation is always ugly.
Wait, I didn't read the bit about JS and CSS aggregation. OK ... but surely there's other ways.
I think this is "caching", with extremely high latency. And, truthfully for most marketing materials the latency isn't an issue, since approval time are probably much higher.
Really to only reason you'd choose a dynamic language for marketing materials is to ensure link and image relational integrity. (people are bad at this, computers are good)
On the last big site I worked on, we used nginx as a reverse proxy to Apache. That way, nginx respects the cache headers you send out and so nearly every request is served straight from its cache pretty much instantly, and you can have a highly dynamic page without having to worry excessively about performance as long as the response time if you do happen to hit an invalidated cache isn't unreasonable. Seems like a good approach to me if the site requires a dynamic back-end and also requires very high performance, as long as there's nothing on the page which needs to be personalised on the server side (i.e. can't be personalised via AJAX).
Not read the article so not sure if this has anything to do with their reasons for using plain HTML (which, of course, has its place!)
I just used middleman http://middlemanapp.com/ for this exact purpose. A lot of sites don't really need a database, and they don't need to be incredibly dynamic. Most Who, What, When, Where, Why questions can be answered statically.
This will not, in fact, produce reliable results. Your conversion rate will tend to change over time anyhow regardless of the "treatment" option in the second "statistically significant amount of time", because your conversion rate is sensitive to things like e.g. traffic mix, PR, and whatnot which are not uncorrelated with when you take the measurements.
This is why we don't do medical trials by giving people aspirin, measuring symptoms, then giving the same people a sugar pill, then measuring symptoms again. Instead, we give different people the two treatments at the same time, such that one population functions as a control group for the other. This is the essence of A/B testing, too.
The right way to measure this if you want reliable results is to put a load balancer or something in front of the page at issue and split half the people into the old architecture and half the people into the new architecture, then measure their conversion rates simultaneously. 37Signals knows this, and they allude to it in their blog post. That's OK though. You don't need to apologize for not gathering good data on whether making your site faster is better. Testing costs money, and testing known-to-be-virtually-universally-superior things is rarely a good allocation of resources.
You just probably shouldn't attribute your increase in conversion rates to the change you made without testing.
That would certainly be better experimental design, since you would be controlling for other factors. On the other hand, precisely measuring the improvement in conversion isn't particularly important in this case; it's already clear that faster is better, so you're not gaining much actionable information from the measurement, whereas you would be giving half of your users a worse experience. In a situation where you were uncertain about which of two methods is better, it would definitely be better to run them in parallel like you've suggested so that you had a fair comparison.
You're right: 37signals don't have to do this test properly. Their prerogative. However, until they do, the 5% figure and implied causation are meaningless.
We don't know if their SD is in the same ballpark. Thousands of possible confounds. Plus, there's no solid a priori reason why shaving off latency should improve their conversion rates drastically -- Basecamp doesn't rest on a large number of small, potentially impulsive transactions like Amazon does. Without more data (or at least an explanation), this doesn't tell us anything.
I agree, and I meant to emphasize more clearly that I don't think the 5% figure is meaningful (and that it shouldn't have been stated without the appropriate caveats in the article).
My main point was simply that I don't think it's prudent to create a worse experience for half of your users when you're so unlikely to gain any actionable information from it. It would be quite extraordinary if they found out that the speed increase caused a decline in conversion or a huge improvement in conversion, so I think it's safe to say that the measurement wouldn't make them reverse the change or make them allocate significantly more resources toward faster load times.
But, to reiterate, I agree that they should not have made the claim about 5% conversion when it wasn't properly supported.
Yeah, serving up two versions simultaneously and split testing them would be more scientific, but I appreciate the number anyway. It was an after the fact observation rather than an original goal, so I wouldn't expect him to go back and deploy the slower version just to test the number.
The hole here is whether or not they unknowingly got a new influx of traffic that was 5% more likely to convert, skewing his final observation, which I would say is unlikely. Your point is good in general, however.
Nitpick: You'd have to test the difference between before and after at a certain level of significance, not just establish estimates for before and after. Different things.
Look at how many blog entries come through here and have the entire site fall over from a few thousand people going to read them. Somewhere there's a poor slicehost instance performing "select * from blog_entry where key like '10_reasons_rails_is_awesome'" thousands of times in a row until something melts.
Worse, the proposed solution when this happens is to add caching.
No, the thing to do would be to add a .html file to your webserver. I defy anybody to find a modern web server that can't serve a static file a thousand times per second from the smallest VPS slice on the market.
It's a solved problem. But people keep unsolving it.