How do you take a server out of rotation if one goes down? For example, if I hit a server in the round robin that isn't online, I could try force-refresh the page to initiate a new lookup but in that case 1 in every 3 requests to the site would still fail (assuming only one server is down, and hoping my client doesn't decide to cache the initial resolution?). In your post you mentioned a browser can handle this transparently - did you mean that if the client can see that the domain has multiple A records and the initial connection fails on one of the IPs, it will automatically try to establish a connection on the next IP in the round robin if the first connection times out, is that correct? Is this a browser standard or is this behaviour handled differently between different browsers?
Either way interesting project, thanks for sharing.
EDIT: Also it'd be cool if the code wasn't in a tarball. I'm on a mobile device (as I'm sure many of your users are too) that doesn't allow me to save/extract the archive. Would've liked to have had a browse through it! Maybe consider uploading to a service like GitHub, or having an extracted version available so we can view the contents directly in a browser? :)
I don't have to do anything to take a server out of rotation. Browsers automatically try all IPs, and then stick with the first IP they find that works. Even if some servers are down, your HTTP requests continue to all work. This is a standard browser behavior. For a really extensive outage (2+ days), I would probably bother to manually update the DNS records.
Thanks for the feeback about making the code browsable. Will consider.
The browser would time out after a few minutes. However if the user stops loading the page and hits Reload, the browser will try another IP and the page will load successfully. I verified this behavior with Chrome. For a personal blog, that's definitely "good enough" HA.
When you think back on it, this was the true genius behind RSS aggregators. I could still read your blog when your blog was not working, because Google Reader (or your favorite aggregator) had already downloaded it once. It's too bad that RSS died.
True. However this specific type of failure should be relatively rare. I expect most of my outages to be either network issues or a powered off machine (TCP timeout) or the web service is not running (TCP RST). Accepting a connection and not replying is just rare for something as dumb as an HTTP request for a static file.
I just use hugo for site generation, isso for self hosted comments, rsync for dumping files on the server and postcss for all the styling. All the scripts are in a package.json so I can download and run everything with an npm run build-init. I don't think most people's personal blogs need to worry about high availability, multiple servers and the complexity that goes along with it, but I'm sure some do. I simply don't have that traffic.
The orange line is neat, I opted for a css avatar replace as isso tags the generated avatars with an id I can grab. The two columns I'd have to get used to, maybe I'm just easily distracted.
I'm not a big fan of the design though. I find the visual hierarchy of the layout a bit confusing:
- There isn't enough whitespace around the content, making it feel cramped
- Scrolling isn't a bad thing. I usually find a top-down flow (from title to content) easier to navigate than reading down from the title to the byline and date, then back up for the content
- Comments and comment form below the content makes sense since people usually like to read and post comments after they've read the blog post. Having them at the top forces users to scroll up again.
After the DNS resolves the IP will be cached by the client for hours. So if any server goes down the blog goes down with it for those users. If you can't reboot any server without causing downtime I don't think it can be called high availability.
Taking a simple jackyl blog on a single instance and adding cloudflare would work much more reliably, and it doesn't require all sorts of rsync tricks.
"After the DNS resolves the IP will be cached by the client for hours"
Not true because my TTL is low (5 min). Chrome will even try another IP within tens of seconds if the initial IP becomes unresponsive. I tested this on Linux. That said I recognize this is not a production-quality mechanism to implement HA (large sites or CDN providers use anycast, load balancers, etc.) But for a personal blog, this is perfectly fine :)
I tested this pretty extensively. When I switched IPs, 99.9% of the HTTP traffic hit the new servers within the TTL time (5min). The only residual traffic hitting the old IPs I saw was from poorly configured bots & webspiders who probably don't re-resolve hostnames frequently enough.
I remember also reading a post from Amazon EC2 Route 53 engineers who investigated DNS propagation times across large-scale tests in the world, and my observations aligned with theirs. They concluded the "DNS doesn't propagate according to TTL" story is mostly a myth (modulo rare issues here and there).
We use DNS for rotations at work. When we make a change, traffic starts moving right away, and most of the traffic moves within two or three TTLs. Most browsers do have some compensation mechanisms when some servers in the list don't respond, and will do even better if some servers are refusing connections. There are lingering connections for a long time though -- I think some DNS servers do cache until they are restarted.
The load balancers we have available to us had worse uptime and scalability than our servers, so DNS it is. It would be nice to fully drain servers in a reasonable amount of time, but I can deal with it.
Indeed, for example Chrome wants to cache DNS results for at least a minute [1].
However, Chrome and Firefox cache the complete DNS/getaddrinfo response, i.e. all addresses. Therefore a server taken out of rotation will be quickly identified when a socket connection fails.
This can't work. The author requirement was to have self-hosted comments without javascript. So his html pages change every few minutes, and cloudflare is just not designed for this.
You don't even need the instance if it's a static site that can be served directly off S3. With S3+Cloudflare for a Jekyll blog, you can handle peaking to thousands of hits a second without noticing it at like $10/month.
> How well do S3 and Cloudflare handle a page changing potentially every 2-3 minutes? I was under the impression S3 is not made for this use case.
It isn't made for that use case...the CDN provider I use takes ~15 minutes to update its Push Zone globally.
However, a pair of $5 VMs and loading comments via JS "solves" the need for dynamic updates since comments are the only thing likely to change frequently. Personally, I don't have much need for hosted comments so I don't bother.
From the deterministic comment IDs section it seems like the design allows a commenter to repeatedly update the timestamp of a comment they authored, which might be amusing:-
- Remember the "seed" value and submit a comment
- Wait for a reply to the comment and then submit the same comment again with the same seed
Great read and serving 2500 hits/sec for $15 is awesome! Even better: in Digital Ocean you could have a snapshot of your server and if needed deploy another node in minutes.
Off-topic: how does rebuilding the blog's menu/index works with Jeckyll? Say you add one entry under home/foo/bar and you have hundreds of pages. Will it actually re-create hundreds of html pages to update the navigation? Or does it use some sort of HTML include/JS magic?
That's peak...I doubt he'd still be paying $15/mo if he was sustaining that kind of traffic. Even as lean as his site is (looks like it's a bit over a kilobyte being served from his domain...wow!), 2.5 MB/sec is around 6.5 TB/mo, which would cost another ~$100/mo from DO.
However unless you have some desire to roll your own setup and maintain/admin the servers, I just don't see the rationale for hosting your own statically-generated site. GH Pages is free. And if you need more than what it offers, Netlify's $7.50/mo plan will be fine for more than 90% of these kinds of blogs and has many advantages...no servers to maintain or worry about going down, use of Akamai's network for faster content delivery, and automatic site builds from source code. And all for less than a 2-server hosted setup. And there's also Surge, Forge and a few other competitors in that space, so it's not like you'd get locked into a single vendor.
The main rationale for hosting my own site is that I do not want to depend on a single provider, ever. Using GH, or Akamai, or Cloudflare, etc means putting all your eggs in one basket. And outages do happen at these companies.
I argue that the statistical chances that 3 outages occur at the same moment at my 3 different providers on 3 different continents is less than the chance of a single outage at a single CDN. Not because of technical reasons, after all they are extremely redundant, but because of process reasons: they are a single company with company-wide processes and sometimes human errors happen causing company-wide outages.
I do realize it sounds silly to be obsessed as I am with redundancy :) but I love exploring these sorts of ideas to see how well they work in practice.
> I argue that the statistical chances that 3 outages occur at the same moment at my 3 different providers on 3 different continents is less than the chance of a single outage at a single CDN.
I'm not sure I'd agree with that. You're probably running a pretty similar software stack on each box, so there's a chance that a software update could take down all three or a vulnerability could put all three at risk. I also have a hard time believing that any one cloud virtual server has many 9s uptime, so 3 together isn't likely to get you to Akamai's level.
But I prefer to rely on the large players simply because there's advantages to staying with the herd. If Akamai has an outage, much of the internet will be down. People will see that your site is down and attribute it to a larger problem. If your 3 boxes have a problem or your DNS provider has a problem, the fact that your site is down will be more apparent to users.
Also, uptime is not just a function of the percentage chance of an outage. The time to recovery also matters. What kind of monitoring do you have in place? How long are your outages likely to last vs those of a large player? If it take you 24 hours to recognize a failed server, then the chances of all 3 failing in a 24 hour period are probably a lot higher than a big player having an outage.
There's a lot that goes into high availability that you just don't get by putting your eggs in three small baskets.
"I also have a hard time believing that any one cloud virtual server has many 9s uptime, so 3 together isn't likely to get you to Akamai's level."
Actually even if the servers are only 98% available (7 days of downtime per year!) then they provide 99.9992% availability, aka "five 9s": 1-((1-.98)^3) = 0.999992
Of course this statistical result assumes downtimes are random and uncorrelated. As you say the risks is that a common bug or vulnerability affects all 3 of them at the same time. Hopefully this risk is mitigated because my software stack is extremely simple (a web server serving static files + ~400 lines of Python code) and I will not be performing software updates on all 3 servers at the same time.
"or your DNS provider has a problem"
I do not use a provider, but run 3 parallel authoritative DNS servers on these 3 servers :)
It depends on the Jekyll template. With mine the list of posts is only displayed in the top-level index.html, so only this file needs to be updated when adding a post.
Looks great at 1230px wide, but 1231 is too cramped. I don't want to write a new comment before reading the article, and then there's a big blank space that takes up 40% of my viewport when the article continues past the comments.
I'd suggest rethinking that breakpoint or doing away with it altogether; centering may feel like a waste of space, but it makes for comfortable reading and your large images can still take up the full width.
Either way interesting project, thanks for sharing.
EDIT: Also it'd be cool if the code wasn't in a tarball. I'm on a mobile device (as I'm sure many of your users are too) that doesn't allow me to save/extract the archive. Would've liked to have had a browse through it! Maybe consider uploading to a service like GitHub, or having an extracted version available so we can view the contents directly in a browser? :)