How we hash our Javascript for better caching and less breakage on updates

KateLawson · on Sept 2, 2009

The hashing approach is fine, but I don't get why they do this at runtime instead of at site push time. There are so many other things you want to do before making the content live like run regression tests and validators, get a commit message that explains the change, etc.

It seems like you will always have a script in there somewhere. Why not have it do the hash tag replacement at the same time and then the content is static?

I may just be a curmudgeon. I go to extremes to turn any runtime code into periodically-generated static code. I once started a mini project that was email-based blogging software. To post a comment, each page and existing comment had a mailto: link with an embedded ID string. Replies would hit procmail, which would pull out the ID and embed the comment in the HTML of the original page at the right place. No Javascript or CGI present, only procmail and static HTML. I wonder if Posterous does some of this?

onedognight · on Sept 2, 2009

> I don't get why they do this at runtime instead of at site push time.

Convenience. I'm sure the extra stat(2) has a measurable cost, but your js files are almost guaranteed to be in cache and linux should be able to stat 1M files a second in that case.

perl -MBenchmark -e 'timethese(3000000, { cached_stat => sub { stat("/etc/passwd") } })'

IsaacSchlueter · on Sept 2, 2009

Yes. You're doing it right.

Run-time works fine if you're loading all your JS and CSS from the same machine as your markup. But that's not an approach that scales quite as well.

onedognight · on Sept 3, 2009

I would assume the files would be in both places, but you just happen to point to the cdn for speed, so at least the frontend could still use the same scheme.

__david__ · on Sept 2, 2009

> The hashing approach is fine, but I don't get why they do this at runtime instead of at site push time.

Mostly for development turn around. It's annoying to "make" between every save and reload. It also makes site updating convenient. We "darcs push" to our test site, make sure nothing is broken and then we "darcs push" to the main site and that's it.

It works fine for the site at its current scale (stats are fast!) and there is a clear path to making it create the files statically should it become a bottleneck.

jim_lawless · on Sept 2, 2009

I have seen some success in controlling caching of JS files by appending a query-string ( including something such as a timestamp of the JS file ) to the end of the URL.

script src="/whatever.js?x=09_02_2009_08_15_AM

Various versions of IE and Mozilla Firefox seem to treat this as a unique URL because of the query-string, even if whatever.js is in the cache. The caching proxies that I've dealt with seem to handle this, but I'm not sure that all of them would ... so the author's approach is much more reliable.

mikeryan · on Sept 2, 2009

Rails does this automatically with JS and CSS files called by the built in helpers.

Also can cache a number of js files into a single JS file for better performance.

Derrek · on Sept 2, 2009

I've had similar success by appending the site version to the query string. src="/foo.js?v=2.1" This has worked for me across all major browsers and required extremely little work for setup and maintenance.

onedognight · on Sept 3, 2009

This requires all api changes to touch the version (to be pedantic) which then becomes a major source of version control conflicts (in my experience).

aaronblohowiak · on Sept 2, 2009

Unfortunately, this doesn't work with cloudfront, so you have to put the uniq string as part of the file/key name

eli · on Sept 2, 2009

Yup, I do the same for CSS files

GrandMasterBirt · on Sept 2, 2009

Tahts exactly what I did about 6 months ago. "/path/file.js?varsion=timestamp" where timestamp is:

A) The last modified time of the file in milliseconds B) The current time milliseconds if in debug mode

Depending on if debug mode is turned on the minified file or original is served to the client. I minify on-demand since it's so fast and files rarely change, only the first request takes the hit and it is not a big hit, maybe 1 or 2 seconds.

I serve the pages with a 1 yr expiration time for these files.

I do the same for CSS.

It works very nicely :) NEVER have to clear a browser's cache when developing and garantee that all users always have the latest code.

__david__ · on Sept 2, 2009

But it doesn't solve the problem where the browser caches the html but not the one of the js files. In this case the user would load one .js file (?version=old) and get the new .js file because they are all pointing to the same file on the disk (url parameters are not separate files). By physically naming the file after the hash you are guaranteed to get either the old set of files or the new set of files.

mikeryan · on Sept 2, 2009

I'm not exactly sure what you mean.

file.js?11111 and file.js?22222 are considered separate files and would get redownloaded by the browser.

The URL parameters are significant in browser caching.

__david__ · on Sept 2, 2009

Yes, but file.js is on the disk and is always the new version.

If the html is cached then it will still have the old reference to file.js?1111. If it goes and gets that from your site it will not get the old one it will get the new one. In which case you now have a mismatch and you can get errors.

With the hashing scheme this does not happen. The html references file-oldhash.js which still exists on the server in its original form. Therefore, your pages are atomic--either you will have all new scripts or all old scripts. Which is exactly what you want--it's the mismatch between new and old that causes problems.

pilif · on Sept 2, 2009

what I don't like is that their solution is still dependent of the file date of the script and all its dependencies. While usually accurate, file dates can be misleading, not to mention the performance hit of doing stat() on all the related files.

I'm using quite the same thing here, but I'm doing the compilation step manually, but also combining all the JS files into one big minified one.

The templating engine always uses that combined file if it's available (one stat call). If not, it uses the non-combined versions (useful for development)

__david__ · on Sept 2, 2009

> what I don't like is that their solution is still dependent of the file date of the script and all its dependencies. While usually accurate, file dates can be misleading

I'm curious what you mean by that. File dates are what "make" and all the other builders are built upon. If they weren't to be trusted then you could never compile programs correctly...

> not to mention the performance hit of doing stat() on all the related files.

The thing is stats are really really fast. Check onedognight's benchmark above (http://news.ycombinator.com/item?id=800778). On my server I get 1.6 million stats per second.

amix · on Sept 2, 2009

I have also used this method for some time. An advantage the author does not mention is that this method is required if you want to host your files on a CDN like Amazon's CloudFront. Another point is that you should host your CSS and image files using this technique (like the author also notes... because you don't have to rely on a browser's cache expiration and because you can cache stuff aggressively [like setting an expiration date 10 years from now]).

jacquesm · on Sept 2, 2009

What goes for .js goes for any content type, static html, png, jpeg, gif files, videos and so on. If there is a caching issue then you will not see updates on the client side.

I can see this method has advantages but it is a band-aid over a non-functioning caching system in the first place.

Makes you wonder what caused the original problem!

This system has advantages, there is no doubt about that, but what caused the original issue (some browsers not fetching updated javascript files) remains unsolved.

jerf · on Sept 2, 2009

Caching is Hard (TM). If you take the time to fully understand the HTTP caching system, it will make sense, and it will become clear that problems with stale content are because you are using it wrong, usually because you set an expires time in the future for content that will change. It's really easy to not understand the caching system, or default to using a framework, and that's where the problem usually lies.

A framework will often choose to default to setting some expires time, often because the framework authors don't really get caching either and don't fully understand why that's not really going to work. (It is natural that they end up here. E-Tags is a much safer default, but to make it efficient requires more work from the framework user; why that is is a bit more than I'd like to get into in this message.)

If content is going to change, you should be using E-Tags, not expiration times. This includes content that you may not think is going to change, like Javascript files, but in fact do sometimes. The best solution in this case is actually to keep creating new URLs and set Expires into the far future, so that if the browser needs "something_8ef38b.js" which had its expires set in 2030, it knows it doesn't even have to hit the server, and if you "change" the file, your new web pages will actually reference "something_f32190.js", a new file. The timestamp-on-the-end trick is the same. I wish more frameworks built this idea in better, it's generally useful.

(Drawing parallels with functional programming's idea of "value" is left as an exercise for the reader.)

onedognight · on Sept 2, 2009

Funny, your best solution is exactly what the post described with the addition of creating the filenames and cache on the fly for ease of use.

jerf · on Sept 2, 2009

I didn't pretend it was new. I was wrapping more context around it, so that it was less "follow this magical recipe" and more "here's why you should do this" (and also "here's a bit on why so many people end up doing this wrong thing"), though a full accounting of HTTP caching would be much longer. I also pointed out similar other things in the comments made by other people.

pyre · on Sept 2, 2009

Why bother with hashing? Why not just append 'version numbers' to the JavaScript files? (e.g. sht.1.js sht.2.js sht.3.js )

This way you have a 'better' history because the numbers are in order and you can immediately tell which JS file was the previous one. While you're at it, why not just make this a commit/push hook rather than needing to generate it on the first page load? It would still be automatic for you.

sh1mmer · on Sept 2, 2009

You should really read Isaac's comments on the blog:

http://blog.greenfelt.net/2009/09/01/caching-javascript-safe...

They are really insightful and expose the knowledge gained from a lot of work he's done at Yahoo! on this subject.

chrisb · on Sept 2, 2009

Interestingly, this is the same solution that is used in GWT - for all the same reasons given in this post.

jokull · on Sept 2, 2009

Django-compress does most of this.