The title suggests that there's something unique about Go, either the language or its standard library, that enables bandwidth savings. In fact, Cloudflare have written some software which they claim enables them to reduce their bandwidth, and this software happens to be written in Go. This might be an excellent choice (and I suspect it probably is), but it's not Go per se that is reducing the bandwidth usage.
I agree. The benefit of using Go is that it's fast to write and has good concurrency features. To give you an idea of the size, there are 7,329 lines of Go code in Railgun (including comments) and a 6,602 line test suite.
In the process we've committed various things back to Go itself and at some point I'll write a blog on the whole experience, but one thing that made a big difference was to write a memory recycler so that for commonly created things (in our case []byte buffers) we don't force the garbage collector to keep reaping memory that we then go back and ask for.
The concurrency through communication is trivial to work with once you get the hang of it and being able to write completely sequential code means that it's easy to grok what your own program is doing.
We've hit some deficiencies in the standard library (around HTTP handling) but it's been fairly smooth. And, as the article says, we swapped native crypto for OpenSSL for speed.
The Go tool chain is very nice. Stuff like go fmt, go tool pprof, go vet, go build make working with it smooth.
sigh This is by far go's biggest wart IMO, and one that frequently sends me back to a pauseless (hah! at least less pausy:) systems language. I sure do like it in almost every other meaningful regard. But I wish latency wasn't something the designers punted on.
I occasionally hear this kind of complaint, but I've yet to see any silver-bullet memory management system. AFAICT, the best we've been able to accomplish is to provide a easier path to correctness with decent overall performance. Also, GC latency isn't the only concern. As soon as the magic incantation "high performance" is uttered, all bets are off.
There's been decades of work on real-time garbage collection yet all of those approaches still have tradeoffs. Consider that object recycling is a ubiquitous iOS memory management pattern. This reduces both memory allocation latencies and object recreation overhead. Ever flick-scroll a long list view on an iPhone? Those list elements that fly off the top are virtually immediately recycled back to the bottom -- it's like a carousel with only about as many items as you can see on screen. The view objects are continually reused, just with new backing data. This approach to performance is more holistic than simply pushing responsibility onto the memory allocator.
Memory recycling here also reminds me of frame-based memory allocator techniques written up in the old Graphics Gems books, a technique likewise covered in real-time systems resources. Allocating memory from the operating system can be relatively expensive and inefficient, even using good ol' malloc. A frame-based allocator grabs a baseline number of pages and provides allocation for one or more common memory object sizes (aka "frames"). Pools for a given frame size are kept separate, which prevents memory fragmentation. Allocation performance is much faster than straight malloc, while increasing memory efficiency for small object allocation and eliminating fragmentation. Again, this is a problem-specific approach that considers needs beyond latency.
"AFAICT, the best we've been able to accomplish is to provide a easier path to correctness with decent overall performance."
Precisely. Which is why for performance-critical systems code it's important to give the programmer the choice of memory allocation techniques, but to add features to the language to make memory use safer.
Garbage collection is great, but occasionally it falls down and programmers have to resort to manual memory pooling. Then it becomes error-prone (use after free, leaks) without type system help such as regions and RAII.
I can't speak for the grandparent, but for my part I agree with your point that allocation patterns matter and that there is no silver bullet to memory management, which is exactly the reason that GC'd languages like Go are uninteresting as systems languages. Why use a language where you have to work around one of its main features when you care about performance?
I find Rust's approach much more interesting, because GC is entirely optional, but it provides abstractions that make it easier to write clear and correct manual memory management schemes.
Do they have a standard ABI or FFI for interaction with C? If so, they probably designed the assumption of a conservative GC into it. You can always make an incompatible change, but it's a pain.
Go does look awesome. I've spent some time with Erlang, Clojure and Scala (roughly in the order that I liked them most), but Go passed the "get started writing useful code quickly" test better than any of them. Haven't gone beyond the basics yet, but I think it might occupy a sweet spot of ease of use combined with "power", loosely defined.
I live in Lancashire at the moment, the only issue for me would really be spending £150 on a round trip to London, do you guys do preliminary Skype interviews?
(I'm aware that the post might not be prestigious as say, engineering - however, I feel that having someone with strong web development experience (who is a user of Cloudflare already) would more than offset the slight inconvenience on your part.)
We typically conduct the first interviews on the phone or Skype for interesting candidates. If it makes sense to do in-person interviews, we're happy to cover the cost of transportation for candidates we're excited about. In other words, if you're excited about working with CloudFlare, don't let the £150 stand in the way of applying.
We definitely do the initial interview via phone/Skype. There absolutely is room to grow and move into other areas of the company with experience. I highly recommend becoming familiar with the platform through the "front lines" of support. It gives engineers a different perspective on our service.
Ok, that sounds great! (Same to the comment above, replying to this one for continuity) - let me mull it over this week.
And I couldn't agree more, being placed in the firing lines of customers is often more telling than building the software yourself - "normal" people tend to notice things which we as developers are prone to miss or gloss over unintentionally.
-----
The awkward moment when I notice I blanked the CEO
Large changes in lifestyle, such as a complete relocation, new job, etc, should not be something undertaken lightly - if I interviewed and got offered the job I would be under a lot of pressure, which I can mitigate now by thinking more carefully before undertaking anything, I wouldn't want to waste both my time and the time of the people at Cloudflare by making an important decision without carefully weighing the pros and cons.
How about just emailing them? I feel like I've seen these kinds of 'wow I where do I apply for a job' posts on cloudflare news articles before. Smells like astroturfing.
The title only suggests something unique about Go to those who didn't read the article.
There's a good chunk of that article dedicated to discussing the language choice and how other languages could have been used instead but -in this specific instance- wasn't chosen. The language choice is as much a part of the topic as the compression routines themselves. So it makes a lot of sense to include the term 'Go' in the title given that's a large focus of the article.
It's really no different to all these articles that spring up about fancy demos being built in Javascript or CSS tricks. Yet in those instances nobody says "the title is misleading. You could write that demo in C++ as well."
>The title only suggests something unique about Go to those who didn't read the article.
The thing is, many people use the title to determine whether the article is worth reading. As is, the title suggests that there is something unique about Go that reduces the bandwidth needed by the program, implying that this is something that other common languages fail to achieve. This is obviously impossible (any widely used language is capable of serializing an output byte stream in any way the programmer desires). As a result, the title sets off the alarm for "Language fanboyism", and "mathematically impossible claims", and goes swiftly into the "don't bother" pile together with "universal lossless compression algorithm invented!"[1], "perpetual motion machine" and "My favourite X language is faster that C/C++/Assembler!1!1"
> The thing is, many people use the title to determine whether the article is worth reading.
That same argument could be used for having the language in the title as people who are not interested in programming are going to be less interested in a thread about programming.
And language fanboyism is going to happen with or without this title (given the content of the article). What's happening here is more a case of lazy members wanting to commentate on articles they've not even read. It's basically the lowest form of blogging.
I've never used go professionally and most of my spare time is split between C++ and Scheme at the moment, but when I did go spelunking with Go, I found it a breeze to write complicated functionality in it - it felt like C++, but easier and more initially powerful.
I still feel that C++ is generally a better choice, but if I only had a short time to write something in, I would definitely go for Go.
For me, I just don't see enough benefits of Go over C++. I already know how to use C++ in a way that avoids or mitigates the problems Go solves. With C++11 support starting to take off Go's advantages are even smaller.
On the other hand, if I didn't know C++ and I was looking for a native compiled language to learn, I'd probably choose Go over C++.
Depending on what you're doing, the libraries make a giant difference. Look over Go's standard library packages and then imagine what a pain in the ass it would be to find and manage all the separate C/C++ libs it would take to replicate all that functionality (or to write it yourself).
Performance, I also feel more in control, it's hard to describe the feeling, it's just as though Go is providing an abstracted interface to the hardware, where as C++ provides raw, unfiltered but potentially dangerous access.
That kind of fits Rob Pike's explanation for why Go isn't more popular with C++ programmers (although he gave it a negative spin).
In a neutral way, it's like: if you spent all that effort mastering this language to get such fine-grained control, why would you give that up again? And really, I understand: why would you give that up? Especially if you know how to use C++ in a fairly painless way.
> Performance, I also feel more in control, it's hard to describe the feeling, it's just as though Go is providing an abstracted interface to the hardware, where as C++ provides raw, unfiltered but potentially dangerous access.
Funny I though C++ did exactly the same thing. Where are the L1, L2 and L3 caches references, multiple opcode execution pipelines, processor instructions ?
Go is used, sure, but the cool part about this is the binary Railgun protocol. Really smart. Send only file hashes and binary diffs back and forth, do a little extra computation to figure out the changes, but only send the absolute minimum data you need to the CDN. That's just smart, and frankly, I hope other CDNs have been doing this already, because at any high volume it seems to be an obvious solution.
So that brings up the question—is this just something CloudFlare is announcing for the PR, or is it actually innovative?
The piece in the CloudFlare network and the piece in the customer network are able to keep track of which page versions they each have and so the part in the CloudFlare network sends a request saying "Please do GET /foo and compress it against version X". That means that at request time there's no back-and-forth between the components deciding what compression dictionary to use.
Well, no good binary delta algorithm uses compression dictionaries anyway (since they are binary deltas, not compression algorithms :P), except to compress the newly added strings, which you can't avoid.
Note of course, that relying on the data not being corrupt on the client (which you must if you assume the compression dictionaries are sane) is dangerous. I assume you guys must store some checksum that you compare once to make sure when someone says "i have version 5, delta against this", that they really have a good copy of version 5?
SVN used to what you are suggesting, btw. We only send clients deltas against the versions they already have, and precompute them in some cases :)
I assume you guys must store some checksum that you compare once to make sure when someone says "i have version 5, delta against this", that they really have a good copy of version 5?
For what it's worth, this is fairly standard binary patching approach as used in software updates. I am aware of at least two mainstream titles that do this, and I'd be surprised if Firefox, for example, doesn't push updates this way.
(edit) That's an awesome name by the way. Railgun.
Well, fuzzy tries to find something to use as a 'destination' file so it can send across some hashes. Railgun has more complete information because it is keeping synchronized and thus the part making a request can specify the dictionary to compress with in a single hash.
Don't you understand? We need you to accept that this is a new technology and a ground-breaking algorithm and a new innovative (and valuable, non-obvious) technology. CloudFlare was established in 2007 with the goal to develop a faster, safter, better internet. CloudFlare, the web performance and security company, set records this month hitting more than 100 million daily active users and more than 50 billion monthly page views!
rsync is going to perform checksums on blocks to see if the blocks are the same.
It transmits these checksums, and where the checksums differ, it deltas the blocks. Note that insertion/deletion in a file can push block boundaries off between two files, causing a problem known as "stream alignment", which can cause your binary delta to be much larger because it doesn't realize the block really shifted 16384 bytes over (or whatever), and so it thinks the client really doesn't have any of the bytes of that block.
In any case, if you know the files are related, you
1. Don't need to do any of this. You can simply send the binary delta that is is usually copy/add instructions (IE copy offset 16384, length 500 to offset 32768)
2. Can precompute the deltas.
You can actually precompute in any case, it just makes no sense unless you know you will be diffed against something else.
Yes, I simplified and I shouldn't have.
It does detect them, but it does have a minimum size of block move it can detect due to the signature matching method.
I think it's more like rsync + git. You have copies of previous versions, and just ask for their hash to figure out which previous version to diff against the current version, then send the diff.
The bandwidth reduction is due to use of a binary protocol, not Go. It just so happens the server code is written in Go and C.
From the article:
“Go is very light,” he said, “and it has fundamental support for concurrent programming. And it’s surprisingly stable for a young language. The experience has been extremely good—there have been no problems with deadlocks or pointer exceptions.” But the code hit a bit of a performance bottleneck under CloudFlare’s heavy loads, particularly because of its cryptographic modules—all of Railgun’s traffic is encrypted from end to end. “We swapped some things out into C just from a performance perspective," Graham-Cumming said.
“We want Go to be as fast as C for these things,” he explained, and in the long term he believes Go’s cryptographic modules will mature and get better. But in the meantime, “we swapped out Go’s native crypto for OpenSSL,” he said, using assembly language versions of the C libraries.
On another note, it's always nice to see such an influential part of the HN community giving quotes for sites like this - not only does it make me a little proud to be associated with any of you, it makes me more hopeful for the chances of my future that I can call myself one of us.
The binary protocol means we don't add (much) overhead, the bandwidth reduction is because we are sending page diffs which themselves are encoded in a compact binary format.
Question for jgrahamc: how much more efficient is your binary delta algorithm than cperciva's bsdiff [1]?
I assume since you've got the preimages of compression, as well as control over the compression format, that the diff and patch operations are much more efficient in space and time than they would be with arbitrary binary data. But...by how much?
bsdiff is not a general purpose binary delta algorithm, it's targeted at executables. When you change a single line in the source code of a program and recompile it, bsdiff produces a small diff, even though a normal binary diff between the old and new executable would be huge due to how even a single extra instruction can cause many more addresses to shift. bsdiff wouldn't be particularly useful here.
I am particularly interested in this aspect of the discussion (explaining the process leading to deciding to develop a new tech in-house instead of re-using any existing approach). In an ideal world there would be plenty of experimentation with real-world data to justify things, but I don't read about that happening too often.
Initially, I wasn't actually planning to do deltas for the compression technique and it was in testing with a whole bunch of common sites that I stumbled upon the fact that they don't change very much. That lead me to wonder about the algorithms that might be used.
I did test quite a lot of stuff (and at one point thought I'd come up with a truly cool new algorithm only to realize that I was mistaken :-) to decide what to do.
Railgun has to trade off three things: compression efficiency, space and time. Because we are trying to do this for performance time is the most important thing to optimize for, followed by efficiency, followed by space. bsdiff is very, very good at delta compressing binary things; Railgun isn't as good, but it's very, very fast.
Out of curiosity, can you say anything about the algorithm you are using?
A year or two ago I got quite interested in delta compression, read all the papers I could find on the topic, and eventually came up with an algorithm that seems pretty competitive, although I've mostly focussed on efficient compression rather than speed. Someday I'll get around to porting the code from Python to C and find out what the performance is really like.
Could be very cool. I couldn't get through the article because it read like a press release. Maybe if someone who hasn't been spoon-fed the story reports on it, I'll take notice.
I don't know why you're being downvoted, the article is written pretty shittily. The article is mostly just quotes from jgc and the CEO and some filler by the writer.
Also the assertion that "It has already cut the bandwidth used by 4Chan and Imgur by half" sounds disingenuous and possibly not backed up by moot's quote “We've seen a ~50% reduction in backend transfer for our HTML pages (transfer between our servers and CloudFlare's),”. Is backend transfer for HTML pages the only bandwidth they're using? Is the rest of it halved, and if so, how and why?
Presuming this is RFC 3229, this is transport compression, not webserver offload.
The response is generated by the origin webserver as normal. But rather than sending that response using the normal HTTP encoding, instead the proxy first does a binary diff against any versions that the (CloudFlare) client says it has and that the (CloudFlare) proxy also has in its cache. They use e.g. ETags or MD5 to uniquely identify the entire response content.
You can still do cookie stripping etc to try to avoid the request to the webserver altogether, but that's a separate concern.
There isn't a per-site cache in Railgun because it's part of our large shared in-memory cache in our infrastructure.
Currently, cookies are not part of the hash.
We have customers of all types using Railgun. As an example, there's a British luggage manufacturer who launched a US e-commerce site last month. They are using it to help alleviate the cross-Atlantic latency. At the same time they see high compression levels as the site boilerplate does not change from person to person viewing the site.
What sort of sites do you think it doesn't apply to?
> What sort of sites do you think it doesn't apply to?
Single page webapps. In those cases the html/js is normally static and already CDN'ed and the data is a JSON API which varies on a per user basis.
There would be some gain as the dictionary would learn the JSON keys but I doubt it would be very dramatic vs deflate compared to the content sites referenced in the article.
Yes. That's up to the particular configuration of the site. It varies from site to site, but for optimal results you want it big enough to keep the content of the common pages of your site.
I'm curious about the crypto part. Could anybody explain to me, if it's a HTTPS link, where does SSL encryption happen? Does Railgun listener talk with the origin server over HTTP or HTTPS?
If it's HTTP, then how does CDN handle certificates? Does it use CDN's certificates?
If it's HTTPS, then 1) Isn't hash gonna be a lot different if if the two versions are very alike? 2) Why does Railgun encrypt the encrypted data again?
The link between CloudFlare and the customer network (i.e. between the two bits of Railgun) is TLS. We have an automated way of provisioning and distributing the certificates necessary for that part.
For the connection from Railgun to the origin server it will depend on the protocol of the actual request being handled. If HTTPS Railgun makes an HTTPS connection to the origin.
The change detection algorithm is clever. But this is a classic memory vs. processor problem. The real trick here is that the Railgun service instantly adds massive amounts of cache to your service; it just so happens - if their claims aren't inflated - adding these additional resources to your service is transparent. This has nothing to do with Railgun being developed on Go.
Other than general traffic data compression, I've always been somewhat interested in html compression in particular.
I know lots of webservers zip their response data, but I was always curious about the things in html that show up very often and if there's a way to optimize around that.
For example, most web xml data contains a lot of common tags, like "div" and "span" and others that are specific to html. I think if you add them up, they might make up a considerable percent of traffic data. Is it possible for the web server to swap those out for a single character before it sends the data, and have the browser replace it when it arrives?
Zip will replace the common tags (like "div") with a single "div" (in the compression dictionary), then a single character every time it appears (more or less - it might be less than a single byte if it's a really common tag). So there'll be a wasted overhead of a dictionary of common tags (which is kind of wasted).
It would be more efficient if both the browsers and compression algorithms could agree (beforehand) had a dictionary of common terms which would be likely to appear in the document.
Of course, my answer on Stackoverflow is pretty crude. You could create a dictionary used to compress the compression dictionary. Google is probably going to do this any time soon (if they haven't already) since they control the client (Chrome), server (google web server) and protocol (SPDY).
This is the third time this week that I've read or heard about Communicating Sequential Processes (CSP), the formal programming language devised by Sir Tony Hoare.
Third time's a charm. Definitely going to have to investigate.
> Today, [cloud providers Amazon Web Services and Rackspace, and thirty of the world’s biggest Web hosting companies] announced that they will support Railgun...
I can't find any such announcements; anybody have links? Based on comments further down, I wonder if the author is confused.
> CloudFlare will provide software images for Amazon and RackSpace customers to install
That is very different from the claim in the first paragraph.
Amazon and Rackspace customers need to install the software themselves (for now). The other listed hosts have made it one-click simple without the customer having to install anything. A couple announcements from major hosts today:
Uho. Binary protocol. The problem being, it's actually bringing financial advantages over HTTP.
HTTP has the advantage of being standard, simple, plain text and thus easy to work with.
Hopefully http2.0 will attempt solving this.. erm...
The article mentions how this compression technique is similar to image compression. Would anyone care to explain, in detail if necessary, how this is so? Thanks.
It's not magic. It's threads. Go multiplexes your goroutines onto N OS threads. There are also abstractions in C/C++ (though of course as libs, not part of the language, like in Go) which hide the usage of threads. But there is no magic. If your code is running in parallel, your code is using OS threads.
I was just looking into what SDCH is (an accept-encoding option from Chrome) and it sounds very, very similar: It generates a dictionary and then uses VCDIFF between requests. Is this related somehow?
Vaguely. Both Railgun and SDHC work by compressing web pages against an external dictionary. In SDHC the dictionary must be generated (somehow), and it is intended for use between a web server and browser. Railgun is back-end for our network and automatically generates dictionaries.
Is anyone aware of a performance analysis between SDCH and one of the dynamic compressions like deflate?
I google, but all I find is people complaining their proxy/filter/appliance/diagnostic is breaking because it doesn't understand SDCH.
It seems like SDCH has been around for 4 years, I presume the lack of data means it hasn't worked out.
(I imagine that you could drastically reduce the CPU load of compression by making simple hard coded state machines for each dictionary. For content like XML or json you could easily make your field names and surrounding punctuation minimal. For many very short messages sharing a dictionary that would beat deflate on compression ratio, and for long messages of non-repeating field values it wouldn't be much worse. CPU use of expansion is probably comparable, though you might get better memory access behavior out of SDCH.)
What you describe is exactly what I've been looking for. There are remarkably few resources on this.
We have users in Singapore who access various XML-heavy web services in our NY office. A dictionary-style over-multiple-requests compression technique would be brilliant for their case.
Take a look at the various WAN accelerator appliances (Cisco, Silverpeak, Riverbed). They do almost exactly what it sounds like you want (if I'm remembering back to my evaluations, Cisco at least uses a multi-request dictionary for their compression)
Riverbed looks ideal but aren't they incredibly expensive? We looked at it years ago and I believe the necessary endpoints in our data center and in Singapore pushed past $140,000.
Riverbed (and all of the players in this space, really) are quite expensive, but this is where you get into the whole "Total ROI" argument for justifying the purchase.
Most companies depreciate hardware over 3 years. How much WAN/Internet bandwidth will you NOT use over the next 3 years, and how does that translate into upgrades you won't need to make?
There are also arguments for these boxes along the lines of "right now we use really expensive WAN links, but these boxes do end-to-end encryption too, so we can put the traffic on the Internet instead" but that opens up a few obvious cans of worms (and can of course be done without an accelerator with VPNs and whatnot).
Then you get into the more nebulous arguments that big bosses tend to like, such as "The average user makes Y XML requests per day to process X widgets. Each request takes Q seconds now. If we lower that to Q*0.5 with WAN acceleration, each user can now process N more widgets per day". Fluffy argument, but can have a big impact on business decision makers, especially if you can tie it to a dollar amount.
Note that WAN Accelerator salespeople are really, really good at coming up with arguments like this for/with you during the sales process.
Railgun is used between CDN and http server, while this one seems to be between browser and http server.
Railgun only requires website to deploy a client, and cooperate with cloudflare. User's client doesn't have to be Chrome or whatever; web server doesn't need to be aware of Railgun. It's transparent to both HTTP clients and HTTP servers. SDCH, however, requires a modification to HTTP/1.1 protocol, which implies changes in both HTTP clients and HTTP servers.
Both are quite promising, though, Railgun is easier to adopt.