The most frustrating part of things like this, and that Snapchat issue, is the largest abusers could probably swallow the cost of their own NTP server usage as a rounding error to their bottom line.
Once you put an NTP server on the 'net, it's public - pretty much like most Web sites. Sure, there are reasonable expectations of decency like for anything in the Commons, but I don't think there's any legal defense against skunks at the picnic.
IIRC, the university called Netgear out for doing something stupid and disruptive, and Netgear stopped doing it. The second best possible scenario, I guess.
Netgear issued patches for the devices. Most people never update their server firmware, and we're talking about over 700,000 devices. The university still gets considerable traffic.
Throttling won't do much good; their WAN interfaces will still eat the traffic, I don't think it's as much of an issue that the NTP servers were melting it's more that their entire network was.
For what you want to have any effect they'll have to sinkhole/throttle the traffic upstream before it ever reaches them and as a university they are effectively an ISP so that might not even be really possible.
It'd be interested in seeing if there's been any update since 2003.
E.g., is it really "considerable traffic" by 2016 standards? The original flood in 2003 was 150 MBps - I don't think I'd notice if I got a flood of 150 MBps on my home connection.
How many of those devices are still around 13 years later?
Wow, I never realized operators couldn't push fixes to their routers without permission. The internet is indeed a tragedy of the commons: trivial to ruin, but a Sisyphean task to fix.
Most admins would consider having network infrastructure's firmware change outside of their control a bug/misfeature. Not to mention most devices would require reboot to apply change.
And to be able to remotely change the code running a HUGE security issue.
The vast majority of admins don't even know that they're admins. They bought or received a cheap Netgear router, plugged it in, and never touched it again, except to maybe turn it off and on again when the internet was slow/down.
If you're an admin who cares about their infrastructure, you're not using a bargain-basement Netgear router, and if you are, you'll have gone through every single menu and seen the auto-update option.
Some operators do, mostly ISPs that lease routers to customers and retain a way to push firmware updates to them (for example, Comcast does this). But router manufacturers typically don't touch the device once it's out of their hands.
Note that cable modems (all of them, not just from Comcast) download their configuration from the provider every time they boot up. Ironically (since it uses TFTP, for one), this is called "secure provisioning".
They might give you a web interface where you can configure certain settings (e.g. integrated Wi-Fi) but the ISP ultimately has at least some control over any cable modem connected to it.
I wonder if they somehow mistakenly joined their server to the region-specific pr.pool.ntp.org group[1]. At the moment, that pool exists, but has no servers in it.
So, if you were the only server in the pool, perhaps you would get a lot of Puerto Rican traffic?
Someone else commented in the article, but it's probably related to Puerto Rico ISPs using a NAT because of lack of IPv4 address space. That single ip is probably many many people.
That one doesn't make sense to me. Most people don't have ntp configured to point to ntp.pool.org...they are mostly PC's pointed at time.windows.com. And, the pool is big enough anyway that it would spread the load from a relatively small island pretty well. NAT could be a small part of it, but there's a different primary cause.
It doesn't necessarily need to be Windows making the calls. Cell phones use a NAT typically and there was recently an issue with Snapchat DDoSing NATs.
> I wonder if Puerto Rico has run out of its pool of IPv4 addresses. After Europe and Asia, just this month Latin America as well, have exhausted their IPv4 pools, many local ISPs have resorted to using NAT to deal with the scarcity of addresses (of course, after years procrastinating IPv6 and pretending that this day wouldn't come about). Given that the source is a Puerto Rican ISP, and one of the offending addresses from a small /21 network, it's possible that NAT is to blame. As ISP NAT increasingly becomes more prevalent, this is going to be rather touchy to deal with abuses. For is it an abuser or just several innocent users behind a NAT?
I'm not understanding why NAT would cause it. I could see something like a misconfigured forwarding DNS cache causing it. Where it only queries pool.ntp.org once, and continues returning the result in the same order (with pivotal's ip at the top of the list) to a large number of querying clients. Then, perhaps, if there are a bunch of natted clients behind one ip? NAT, on it's own, without some other contributing factor, shouldn't cause this.
NAT wouldn't cause it but hide that in fact those are many client all having the same source IP. Of course, that wouldn't explain why they observed a general increase in traffic.
I wish he'd explained somewhere how they leapt to examining virtualized NTP clients, or what they ultimately did (since there's no part 3 that I can find).
> I wish he'd explained somewhere how they leapt to examining virtualized NTP clients...
I had a hunch [wrongly] that the traffic was caused by a particular operating system. I didn't have enough machines to run the tests on bare-metal, so I virtualized them. And I suspected that virtualization would provide a worst-case scenario (the virtualized clocks would be jittery).
My big surprise was that Windows was a model client (once per day), OS X was good, too, and that FreeBSD was the worst and Ubuntu a close second. It was a complete inversion of what I had expected to find.
> what they ultimately did (since there's no part 3 that I can find).
I never wrote part 3, but now that it seems the post has gained traction on Hacker News I might be inspired to write one. The summary would be as follows:
To reduce your costs by a quarter, use the following lines in your NTP configuration file to throttle overly-aggressive clients (the important directive is `limited`:
I have never understood why people think kod is a useful setting. Why do you think a misbehaving/improperly configured client is going to honor the kod packet? The kod packet helps with some clients but I have never seen it change the behavior of the most egregious abusers. Just ignore future requests from misbehaving clients, there is not a lot of benefit in saying "please stop misbehaving" to a client that does not follow the spec.
Your configuration is lacking a number two big best practices. The most glaring is that you really need to add 'iburst' to your server stanzas. After that you should think about adjust minsane and minclock.
It's not strictly harmful, I think, to presume that some percentage of misbehaving clients might just be misconfigured and honor a KoD, as long as you have other measures, unless I'm overlooking something?
If an NTP client is already misbehaving and/or misconfigured to the point where it's considered "abusive", what are the chances the client will do The Right Thing(TM) when it receives a "kiss of death" ("kod") packet from the NTP server?
Enable KOD, by all means, but you may also consider putting in some (high) per-IP rate limiting for 123/UDP in your firewall rules as a backup plan (for if/when clients ignore kod).
Virtualization and time sync have had notorious problems. One ugly work around was frequent NTP polling and adjustments. NTP has a min and max poll interval, and it determines how frequently it should poll automatically based on how far it sees drift happening. If it drifts pretty fast, it will quickly gravitate to the minpoll value, which is exactly what they show in their first graph: tons of polling at the minfrequency for certain hypervisors.
I've found it to be good practice to run a single/few stratum 2 node(s) to serve local resources. To whatever degree, it's usually more important that these resources be more in sync with each other than with a satellite, which is fostered by having as few nodes as possible trying their luck over the internet to bogged down public stratum 1 NTP sources, instead configuring them to use a single source of time from a box more in its vicinity.
Most corporate environments have long defaulted to having a strata 2 NTP server, as long as they use Windows since Kerberos requires clocks to be roughly in sync.
With the advent of the cloud I see a lot of businesses forgoing any central IdM solution like AD or IPA and as a result don't tend to have an NTP server unless they explicitly configured one.
All of my servers are joined to an AD or IPA domain, so they all use my local NTP servers by default.
Hats off to everyone contributing to public services like this.
My then company wanted to give back ny doing this many years ago and it was an eye opening experience. We had troubles almost immediately with utilization and script kiddies. The company ended up only doing it for a relatively short period and ended up making contributions to projects instead
Honestly, that's the University's fault then. Properly configured, it should've had very little noticeable effect on the firewall (i.e. "permit udp any host 10.11.12.13 eq 123") as there's no need to do any inspection or tracking state ...
... unless they saturated the available bandwidth but, really, that's a different issue (although also preventable!).
I think this is great look at walking through the analysis. I too experienced a huge spike in NTP traffic in 2014 but it was because of people exploiting NTP for reflection attacks to DDOS other parties. The forced me to use a GPS module and a Beaglebone Black as an internal time server (which has been great)
I put the GPS antenna on my window sill, it has no problem at all staying locked. My plan had been to stick it outside the window but turned out not to be necessary.
I've also used the Adafruit module, although with a Raspberry Pi. FWIW, the PPS signal seemed to be a little off from another receiver though (Garmin GPS 18x LVC). I never tried to pinpoint the issue but I strongly suspected the Pi. No noticeable issues w/ the antenna inside on a window sill (should be fine as long as you can see four satellites).
I haven't tried it myself but I've heard of several other good experiences w/ the BeagleBone Black. The Garmin seemed to work the best for me, although it is a little more expensive. I was strongly considering putting a few of them in $work's (private) facilities as a fun, nerdy project but I never got around to actually doing it. The Garmin with a BBB might very well be a great combination for that.
One other thing: make sure you use a "real" serial (or parallel) port -- not a USB to serial adapter!
Digital Ocean is a great deal! Thanks for pointing that out.
The reason I use {aws,azure,google} to host my NTP servers is that my day job is developing a VM orchestrator (BOSH) for Cloud Foundry, and BOSH doesn't support Digital Ocean yet (AFAIK). But that's a personal choice, and an admittedly expensive one.
Why not just use the public pool? A virtualized NTP server isn't the ideal scenario and there are plenty of "not quite 100% public" NTP servers you could use as well (i.e. you first have to send an e-mail to get access).
Funnily enough, last time I had NTP troubles, it was on a BOSH deployment. BOSH wants to use ntpdate to keep time (because real NTP is too normal?) and we'd overzealously configured the security groups around our NAT setup so the machines couldn't reach their NTP server. Whoops.
It'd be nice if BOSH could have detected this problem and warned us about it, ideally during deployment. But it'd be even nicer if we didn't suck at configuring AWS. If you could fix either of those things, that would be great!
It also runs on AWS so even though they try to trick you into thinking you don't pay for bandwidth you do. Also small EC2 instances are throttled so their server would probably not function most of the time.
Literally any of the major providers like Linode or Digital Ocean give you much more reasonably priced bandwidth. You don't get all the nice AWS tools, but if all you use in EC2 or S3, you can still use Vagrant or Terraform plus some type of configuration management (Ansible, Puppet, etc.) to provision servers programmatically.
Here's my Hetzner VM (Germany). +/- 10 milliseconds, though I can't help but suspect the distance from the monitoring station (Los Angeles) may have more to do with it than being a VM:
Everyone's needs differ, I suppose, so some might consider that "decent". 10ms -- or even 50ms -- might be acceptable for many (most?) use cases but not for me.
From a quick look, my own (stratum 2) server in the pool currently has an offset of just under 1/20th of one millisecond.
For another data point, here's my Digital Ocean droplet in Bangalore which sees from 10-60k queries per second depending on the time of day:
http://www.pool.ntp.org/user/bradfa
I had the same thought. Does the world really need another (presumably) stratum-3 server running in Amazon's cluster, when Amazon already runs a pool of stratum-2 servers? ([0-3].amazon.pool.ntp.org).
By my recent reading of the AWS docs, those pool addresses are not run by Amazon. They are DNS names that allow NTP load from AWS to the NTP pool to be distributed more fairly.
You're not supposed to run an NTP server on a VM. The CPU cycles can be taken from the guest and perhaps used by the host or another guest. Other options that are available, is to drill a hole in the data center and run a GPS receiver to the roof and get the time from the GPS satellites. I think that is something many VPS/cloud providers will not allow.
Pivotal bridges the Silicon Valley state of mind, modern approach and infrastructure with your organization’s core expertise and values. Who we are and what we do together can reshape the world
Because VMs themselves might not be able to keep track of time accurately (potentially inconsistent tickrate) the way that a bare-metal setup would be able to. That's why they should be mere consumers (as in sync their time to whatever the remote says rather than contribute to the pool).
Why would anyone run an authoritative time service on a virtual server in the first place? My experience is that system time suffers greatly from noisy neighbor.
I'm not sure if I've missed it, but is the question (from the title) ever answered? The discrepancy between expected traffic volume and actual traffic volume is huge and seemingly unexplained.
My bad — I never wrapped it up. Thanks to the HN interest, I'll try to write Part 3 over the winter break.
The short version is this: it's gonna cost a couple of hundred dollars to run a 1Gbe NTP server in pool.ntp.org, but you can tweak the ntp.conf to save ~$100.
This is the Snapchat bug reported yesterday, right?
Incidentally, how is AWS dealing with the leap second next week? Google is going to have their time servers start to run fast around 20 minutes in advance of the leap second, so they're back in sync at 00:00:60 UTC.
If memory serves, Google's "smoothing" the second out over (I think) a 24-hour period. I don't recall the exact time period off the top of my head but it's much, much longer than 20 minutes.
Note that inbound traffic which was steady at ~4k packets/sec spikes as high as five times as much. Also note that the snapchat traffic followed a circadian rhythm (much higher traffic during the daytime).
Article said no, because the traffic was symmetrical and not lopsided. If this had been part of an attack you'd expect to see far more outgoing bandwidth than incoming.