Why is my NTP server costing $500 per year? Part 1 (2014)

preinheimer · on Dec 24, 2016

Reminds me of when Netgear decided to use the University of Wisconsin NTP servers as the default in their consumer products: http://pages.cs.wisc.edu/~plonka/netgear-sntp/

BoorishBears · on Dec 24, 2016

The most frustrating part of things like this, and that Snapchat issue, is the largest abusers could probably swallow the cost of their own NTP server usage as a rounding error to their bottom line.

quickben · on Dec 24, 2016

The university commitment to still serve the public is admirable.

zump · on Dec 24, 2016

How is that even legal?

Elrac · on Dec 24, 2016

Once you put an NTP server on the 'net, it's public - pretty much like most Web sites. Sure, there are reasonable expectations of decency like for anything in the Commons, but I don't think there's any legal defense against skunks at the picnic.

IIRC, the university called Netgear out for doing something stupid and disruptive, and Netgear stopped doing it. The second best possible scenario, I guess.

DanBC · on Dec 24, 2016

> Netgear stopped doing it.

Netgear issued patches for the devices. Most people never update their server firmware, and we're talking about over 700,000 devices. The university still gets considerable traffic.

https://en.wikipedia.org/wiki/NTP_server_misuse_and_abuse#NE...

agumonkey · on Dec 24, 2016

Can't the university throttle non local connections ? (netgear could have provided the hardware ;)

dogma1138 · on Dec 24, 2016

Throttling won't do much good; their WAN interfaces will still eat the traffic, I don't think it's as much of an issue that the NTP servers were melting it's more that their entire network was.

For what you want to have any effect they'll have to sinkhole/throttle the traffic upstream before it ever reaches them and as a university they are effectively an ISP so that might not even be really possible.

zamfi · on Dec 24, 2016

Not necessary -- fortunately these routers all used source port 23457 for their NTP packets, making them trivially easy to block.

I do recommend reading the incident report posted above if you have an interest in network operations, it's quite interesting!

feld · on Dec 24, 2016

You still have to receive the traffic to block it.

niij · on Dec 24, 2016

It is a public NTP server. Throttling public defeats the point.

late2part · on Dec 24, 2016

Not really. If an NTP client gets 1/10th of the updates that it wants, it will still keep reasonably good time.

nickysielicki · on Dec 24, 2016

Keeping a table of clients and their last-updated time is probably more expensive than just sending them a response.

dom0 · on Dec 24, 2016

Randomly dropping a percentage of requests is stateless.

foepys · on Dec 24, 2016

But what happens if you are the client whose requests are dropped all the time because you are unlucky?

tedunangst · on Dec 24, 2016

Then I guess you'll have to use somebody else's free server.

zymhan · on Dec 24, 2016

Is the NTP synchronization done with only a single packet? Because otherwise you'll be interrupting connections constantly.

gpderetta · on Dec 24, 2016

It is UDP, so it is connectionless.

zymhan · on Dec 27, 2016

UDP is a Transport Layer protocol. NTP is an Application Layer protocol.

gpderetta · on Dec 27, 2016

Fair enough, UDP doesn't imply that the application protocol is connection-less, but AFAIK NTP is.

detaro · on Dec 24, 2016

Do NTP clients not retry immediately/very quickly if a request gets lost?

vacri · on Dec 24, 2016

It's more that your device's clocks don't skew that rapidly, unless something is really wrong.

detaro · on Dec 25, 2016

But dropping requests to reduce traffic load is counter-productive if failed requests are quickly retried.

mitchty · on Dec 24, 2016

Retry yes, immediately/very quickly no. Ntp is designed to handle network issues transparently.

foota · on Dec 24, 2016

This was one of the issues wth netgears client, it retried every second until it worked.

mitchty · on Dec 25, 2016

Incorrect clients aside, it should at worst be another 64 seconds (from memory so I might be wrong) before a client retries a poll.

kalleboo · on Dec 25, 2016

> The university still gets considerable traffic

It'd be interested in seeing if there's been any update since 2003.

E.g., is it really "considerable traffic" by 2016 standards? The original flood in 2003 was 150 MBps - I don't think I'd notice if I got a flood of 150 MBps on my home connection.

How many of those devices are still around 13 years later?

preinheimer · on Dec 24, 2016

I believe an "agreement was forged", no public details, but I might assume some money changed hands.

DanBC · on Dec 24, 2016

$350,000.

https://en.wikipedia.org/wiki/NTP_server_misuse_and_abuse#NE...

> NETGEAR has donated $375,000 to the University of Wisconsin–Madison's Division of Information Technology for their help in identifying the flaw.

Although that appears to be uncited.

duaneb · on Dec 24, 2016

Wow, I never realized operators couldn't push fixes to their routers without permission. The internet is indeed a tragedy of the commons: trivial to ruin, but a Sisyphean task to fix.

njharman · on Dec 24, 2016

Most admins would consider having network infrastructure's firmware change outside of their control a bug/misfeature. Not to mention most devices would require reboot to apply change.

And to be able to remotely change the code running a HUGE security issue.

kalleboo · on Dec 25, 2016

The vast majority of admins don't even know that they're admins. They bought or received a cheap Netgear router, plugged it in, and never touched it again, except to maybe turn it off and on again when the internet was slow/down.

If you're an admin who cares about their infrastructure, you're not using a bargain-basement Netgear router, and if you are, you'll have gone through every single menu and seen the auto-update option.

duaneb · on Dec 25, 2016

Sure. It's also why the internet is super vulnerable to 0-days.

_delirium · on Dec 24, 2016

Some operators do, mostly ISPs that lease routers to customers and retain a way to push firmware updates to them (for example, Comcast does this). But router manufacturers typically don't touch the device once it's out of their hands.

jlgaddis · on Dec 25, 2016

Note that cable modems (all of them, not just from Comcast) download their configuration from the provider every time they boot up. Ironically (since it uses TFTP, for one), this is called "secure provisioning".

They might give you a web interface where you can configure certain settings (e.g. integrated Wi-Fi) but the ISP ultimately has at least some control over any cable modem connected to it.

tyingq · on Dec 24, 2016

I wonder if they somehow mistakenly joined their server to the region-specific pr.pool.ntp.org group[1]. At the moment, that pool exists, but has no servers in it.

So, if you were the only server in the pool, perhaps you would get a lot of Puerto Rican traffic?

[1]http://www.pool.ntp.org/zone/pr

MichaelRenor · on Dec 24, 2016

Someone else commented in the article, but it's probably related to Puerto Rico ISPs using a NAT because of lack of IPv4 address space. That single ip is probably many many people.

tyingq · on Dec 24, 2016

That one doesn't make sense to me. Most people don't have ntp configured to point to ntp.pool.org...they are mostly PC's pointed at time.windows.com. And, the pool is big enough anyway that it would spread the load from a relatively small island pretty well. NAT could be a small part of it, but there's a different primary cause.

MichaelRenor · on Dec 24, 2016

It doesn't necessarily need to be Windows making the calls. Cell phones use a NAT typically and there was recently an issue with Snapchat DDoSing NATs.

tyingq · on Dec 24, 2016

That's an example of a "different primary cause". It's not NAT in that case, it's an app using a library with terrible defaults.

libeclipse · on Dec 24, 2016

An interesting theory in one of the comments.

> I wonder if Puerto Rico has run out of its pool of IPv4 addresses. After Europe and Asia, just this month Latin America as well, have exhausted their IPv4 pools, many local ISPs have resorted to using NAT to deal with the scarcity of addresses (of course, after years procrastinating IPv6 and pretending that this day wouldn't come about). Given that the source is a Puerto Rican ISP, and one of the offending addresses from a small /21 network, it's possible that NAT is to blame. As ISP NAT increasingly becomes more prevalent, this is going to be rather touchy to deal with abuses. For is it an abuser or just several innocent users behind a NAT?

tyingq · on Dec 24, 2016

I'm not understanding why NAT would cause it. I could see something like a misconfigured forwarding DNS cache causing it. Where it only queries pool.ntp.org once, and continues returning the result in the same order (with pivotal's ip at the top of the list) to a large number of querying clients. Then, perhaps, if there are a bunch of natted clients behind one ip? NAT, on it's own, without some other contributing factor, shouldn't cause this.

discordianfish · on Dec 24, 2016

NAT wouldn't cause it but hide that in fact those are many client all having the same source IP. Of course, that wouldn't explain why they observed a general increase in traffic.

sigio · on Dec 24, 2016

part 2: https://blog.pivotal.io/labs/labs/ntp-server-costing-500year...

rincebrain · on Dec 24, 2016

I wish he'd explained somewhere how they leapt to examining virtualized NTP clients, or what they ultimately did (since there's no part 3 that I can find).

brian_cunnie · on Dec 24, 2016

[author]

> I wish he'd explained somewhere how they leapt to examining virtualized NTP clients...

I had a hunch [wrongly] that the traffic was caused by a particular operating system. I didn't have enough machines to run the tests on bare-metal, so I virtualized them. And I suspected that virtualization would provide a worst-case scenario (the virtualized clocks would be jittery).

My big surprise was that Windows was a model client (once per day), OS X was good, too, and that FreeBSD was the worst and Ubuntu a close second. It was a complete inversion of what I had expected to find.

> what they ultimately did (since there's no part 3 that I can find).

I never wrote part 3, but now that it seems the post has gained traction on Hacker News I might be inspired to write one. The summary would be as follows:

To reduce your costs by a quarter, use the following lines in your NTP configuration file to throttle overly-aggressive clients (the important directive is `limited`:

``` restrict default limited kod nomodify notrap nopeer discard minimum 0 ```

Here is a description of NTP rate-limiting and why `limited`, `kod`, and `discard` are important: https://www.eecis.udel.edu/~mills/ntp/html/rate.html

Here is my current NTP configuration: https://github.com/cunnie/deployments/blob/95e9c71e882d453ec...

dfc · on Dec 24, 2016

I have never understood why people think kod is a useful setting. Why do you think a misbehaving/improperly configured client is going to honor the kod packet? The kod packet helps with some clients but I have never seen it change the behavior of the most egregious abusers. Just ignore future requests from misbehaving clients, there is not a lot of benefit in saying "please stop misbehaving" to a client that does not follow the spec.

Your configuration is lacking a number two big best practices. The most glaring is that you really need to add 'iburst' to your server stanzas. After that you should think about adjust minsane and minclock.

rincebrain · on Dec 25, 2016

It's not strictly harmful, I think, to presume that some percentage of misbehaving clients might just be misconfigured and honor a KoD, as long as you have other measures, unless I'm overlooking something?

dfc · on Dec 25, 2016

I never said it was harmful.

AstroJetson · on Dec 24, 2016

Can you post a few more lines about this? I'm in the process of standing up a server and would like to know more about controlling the load. Thanks!

jlgaddis · on Dec 25, 2016

If an NTP client is already misbehaving and/or misconfigured to the point where it's considered "abusive", what are the chances the client will do The Right Thing(TM) when it receives a "kiss of death" ("kod") packet from the NTP server?

Enable KOD, by all means, but you may also consider putting in some (high) per-IP rate limiting for 123/UDP in your firewall rules as a backup plan (for if/when clients ignore kod).

spydum · on Dec 24, 2016

Virtualization and time sync have had notorious problems. One ugly work around was frequent NTP polling and adjustments. NTP has a min and max poll interval, and it determines how frequently it should poll automatically based on how far it sees drift happening. If it drifts pretty fast, it will quickly gravitate to the minpoll value, which is exactly what they show in their first graph: tons of polling at the minfrequency for certain hypervisors.

rincebrain · on Dec 24, 2016

I knew that was once an issue, I'm surprised to find it's still an issue these days.

In addition, even with minpoll, that's still a hell of a lot of clients abruptly polling him abusively.

raverbashing · on Dec 24, 2016

I wonder, should this be solved with local NTP servers that resolve from the name pool.ntp.org?

rhizome · on Dec 24, 2016

I've found it to be good practice to run a single/few stratum 2 node(s) to serve local resources. To whatever degree, it's usually more important that these resources be more in sync with each other than with a satellite, which is fostered by having as few nodes as possible trying their luck over the internet to bogged down public stratum 1 NTP sources, instead configuring them to use a single source of time from a box more in its vicinity.

snuxoll · on Dec 24, 2016

Most corporate environments have long defaulted to having a strata 2 NTP server, as long as they use Windows since Kerberos requires clocks to be roughly in sync.

With the advent of the cloud I see a lot of businesses forgoing any central IdM solution like AD or IPA and as a result don't tend to have an NTP server unless they explicitly configured one.

All of my servers are joined to an AD or IPA domain, so they all use my local NTP servers by default.

wattt · on Dec 24, 2016

I too was expecting some sort of conclusion.

brian_cunnie · on Dec 24, 2016

[author]

My bad. Now that the post has been picked up on HN, I'll try to write up a conclusion.

Spooky23 · on Dec 24, 2016

Hats off to everyone contributing to public services like this.

My then company wanted to give back ny doing this many years ago and it was an eye opening experience. We had troubles almost immediately with utilization and script kiddies. The company ended up only doing it for a relatively short period and ended up making contributions to projects instead

brian_cunnie · on Dec 24, 2016

[author]

Thanks, Spooky23. As you pointed out, contributing to the community takes more time than originally expected.

zanchey · on Dec 25, 2016

Our student-run computing club added a machine to the pool and melted the University's firewall. Oops.

jlgaddis · on Dec 25, 2016

Honestly, that's the University's fault then. Properly configured, it should've had very little noticeable effect on the firewall (i.e. "permit udp any host 10.11.12.13 eq 123") as there's no need to do any inspection or tracking state ...

... unless they saturated the available bandwidth but, really, that's a different issue (although also preventable!).

ChuckMcM · on Dec 24, 2016

I think this is great look at walking through the analysis. I too experienced a huge spike in NTP traffic in 2014 but it was because of people exploiting NTP for reflection attacks to DDOS other parties. The forced me to use a GPS module and a Beaglebone Black as an internal time server (which has been great)

daveguy · on Dec 24, 2016

I have a few questions about that if you have a minute:

What GPS module did you go with and is it still available? Did you have problems getting signal inside (need to be by a window, run an antenna, etc)?

ChuckMcM · on Dec 24, 2016

I used the Adafruit "ultimate" GPS module (https://www.adafruit.com/product/746) which has the 1pps output and can connect to an external antenna. Then I got this antenna (https://www.adafruit.com/product/960) and this adapter (https://www.adafruit.com/product/851). Soldered a header connector to Beaglebone protocape (https://www.adafruit.com/product/572), wired it to the serial port and PPS to the GPIO pin (just like this: https://web.archive.org/web/20131209092059/http://the8thlaye...).

I put the GPS antenna on my window sill, it has no problem at all staying locked. My plan had been to stick it outside the window but turned out not to be necessary.

jlgaddis · on Dec 25, 2016

I've also used the Adafruit module, although with a Raspberry Pi. FWIW, the PPS signal seemed to be a little off from another receiver though (Garmin GPS 18x LVC). I never tried to pinpoint the issue but I strongly suspected the Pi. No noticeable issues w/ the antenna inside on a window sill (should be fine as long as you can see four satellites).

I haven't tried it myself but I've heard of several other good experiences w/ the BeagleBone Black. The Garmin seemed to work the best for me, although it is a little more expensive. I was strongly considering putting a few of them in $work's (private) facilities as a fun, nerdy project but I never got around to actually doing it. The Garmin with a BBB might very well be a great combination for that.

One other thing: make sure you use a "real" serial (or parallel) port -- not a USB to serial adapter!

mwfj · on Dec 24, 2016

Because you use AWS and they charge insane fees for outgoing bandwidth.

z92 · on Dec 24, 2016

This $500 would have been $60, on a Digital Ocean box. DO has 1 TB/month limit. Their usage was 300 GB/month.

brian_cunnie · on Dec 24, 2016

[author]

Digital Ocean is a great deal! Thanks for pointing that out.

The reason I use {aws,azure,google} to host my NTP servers is that my day job is developing a VM orchestrator (BOSH) for Cloud Foundry, and BOSH doesn't support Digital Ocean yet (AFAIK). But that's a personal choice, and an admittedly expensive one.

jlgaddis · on Dec 25, 2016

Why not just use the public pool? A virtualized NTP server isn't the ideal scenario and there are plenty of "not quite 100% public" NTP servers you could use as well (i.e. you first have to send an e-mail to get access).

twic · on Dec 24, 2016

Funnily enough, last time I had NTP troubles, it was on a BOSH deployment. BOSH wants to use ntpdate to keep time (because real NTP is too normal?) and we'd overzealously configured the security groups around our NAT setup so the machines couldn't reach their NTP server. Whoops.

It'd be nice if BOSH could have detected this problem and warned us about it, ideally during deployment. But it'd be even nicer if we didn't suck at configuring AWS. If you could fix either of those things, that would be great!

diegorbaquero · on Dec 24, 2016

What about amazon lightsail?

ilaksh · on Dec 24, 2016

It also runs on AWS so even though they try to trick you into thinking you don't pay for bandwidth you do. Also small EC2 instances are throttled so their server would probably not function most of the time.

diegorbaquero · on Jan 5, 2017

They include 1TB BW in the $5 instance, that's $90 savings.

djsumdog · on Dec 24, 2016

Literally any of the major providers like Linode or Digital Ocean give you much more reasonably priced bandwidth. You don't get all the nice AWS tools, but if all you use in EC2 or S3, you can still use Vagrant or Terraform plus some type of configuration management (Ansible, Puppet, etc.) to provision servers programmatically.

fxlv · on Dec 24, 2016

I'm surprised there are no comments about the fact that these guys decided to run NTPD on a VM.

brian_cunnie · on Dec 24, 2016

[author]

NTP runs fairly decently in a VM. Don't take my word for it — look at the graphs of my servers:

Here's my Google VM, notice the jitter is within +/- 5 milliseconds:

http://www.pool.ntp.org/scores/104.155.144.4

Here's my Hetzner VM (Germany). +/- 10 milliseconds, though I can't help but suspect the distance from the monitoring station (Los Angeles) may have more to do with it than being a VM:

http://www.pool.ntp.org/scores/78.46.204.247

Here's my AWS VM. Much worse than Google in that it's +/- 50 milliseconds, but still good enough to pass muster with pool.ntp.org:

http://www.pool.ntp.org/scores/52.0.56.137

Here's my Azure VM. It's in Singapore, and I re-deployed it last night, so the numbers are still coming in, but it has a pretty tight distribution:

http://www.pool.ntp.org/scores/52.187.42.158

jlgaddis · on Dec 25, 2016

Everyone's needs differ, I suppose, so some might consider that "decent". 10ms -- or even 50ms -- might be acceptable for many (most?) use cases but not for me.

From a quick look, my own (stratum 2) server in the pool currently has an offset of just under 1/20th of one millisecond.

Regardless, thanks for contributing to the pool!

bradfa · on Dec 25, 2016

For another data point, here's my Digital Ocean droplet in Bangalore which sees from 10-60k queries per second depending on the time of day: http://www.pool.ntp.org/user/bradfa

fxlv · on Dec 24, 2016

Interesting stats. Thanks!

cmrx64 · on Dec 24, 2016

I had the same thought. Does the world really need another (presumably) stratum-3 server running in Amazon's cluster, when Amazon already runs a pool of stratum-2 servers? ([0-3].amazon.pool.ntp.org).

simtel20 · on Dec 24, 2016

By my recent reading of the AWS docs, those pool addresses are not run by Amazon. They are DNS names that allow NTP load from AWS to the NTP pool to be distributed more fairly.

X-Istence · on Dec 24, 2016

It's a vendor zone, anyone can register for a vendor zone...

cmrx64 · on Dec 24, 2016

Ok, interesting.

kondor6c · on Dec 24, 2016

You're not supposed to run an NTP server on a VM. The CPU cycles can be taken from the guest and perhaps used by the host or another guest. Other options that are available, is to drill a hole in the data center and run a GPS receiver to the roof and get the time from the GPS satellites. I think that is something many VPS/cloud providers will not allow.

Edit: s/your/you're/

late2part · on Dec 24, 2016

t1.micro to boot - with more than the usual nondeterministic statmuxing and arbitrary "fairness" policy.

brian_cunnie · on Dec 24, 2016

[author]

Yeah, maybe that's why my AWS instance is the most jittery of my 4 timeservers (Google, Hetzner, Azure).

late2part · on Dec 25, 2016

Your website says:

Pivotal bridges the Silicon Valley state of mind, modern approach and infrastructure with your organization’s core expertise and values. Who we are and what we do together can reshape the world

Your article suggests otherwise.

feld · on Dec 24, 2016

Why are people joining VMs to the NTP pool? These servers should be identified by address space and blacklisted.

foota · on Dec 24, 2016

therein · on Dec 24, 2016

Because VMs themselves might not be able to keep track of time accurately (potentially inconsistent tickrate) the way that a bare-metal setup would be able to. That's why they should be mere consumers (as in sync their time to whatever the remote says rather than contribute to the pool).

feld · on Dec 25, 2016

Correct, unless you have a very specific VM configuration where you are truly dedicating a CPU/core to a VM, it's not fit for being an NTP server.

technion · on Dec 24, 2016

I ran an NTP server on a Raspberry Pi for some time.

The bottleneck I kept hitting was the 65535 NAT translation limit on my Cisco router, at which point, load was quite manageable on the Pi.

It's extraordinary how much traffic one cheap device could service.

Steeeve · on Dec 24, 2016

This is a 2 year old article. Where's the follow up? What did they end up doing?

EDIT: AHA! Part 2: https://blog.pivotal.io/labs/labs/ntp-server-costing-500year...

jelder · on Dec 24, 2016

Why would anyone run an authoritative time service on a virtual server in the first place? My experience is that system time suffers greatly from noisy neighbor.

ddorian43 · on Dec 24, 2016

... because it's hosted on the cloud and you have no amount of free bandwidth with your vps ?

lucb1e · on Dec 24, 2016

I'm not sure if I've missed it, but is the question (from the title) ever answered? The discrepancy between expected traffic volume and actual traffic volume is huge and seemingly unexplained.

brian_cunnie · on Dec 24, 2016

[author]

My bad — I never wrapped it up. Thanks to the HN interest, I'll try to write Part 3 over the winter break.

The short version is this: it's gonna cost a couple of hundred dollars to run a 1Gbe NTP server in pool.ntp.org, but you can tweak the ntp.conf to save ~$100.

Animats · on Dec 24, 2016

This is the Snapchat bug reported yesterday, right?

Incidentally, how is AWS dealing with the leap second next week? Google is going to have their time servers start to run fast around 20 minutes in advance of the leap second, so they're back in sync at 00:00:60 UTC.

jeffbarr · on Dec 25, 2016

Details on AWS at https://aws.amazon.com/blogs/aws/look-before-you-leap-decemb...

jlgaddis · on Dec 25, 2016

Nope, this is from 2014.

If memory serves, Google's "smoothing" the second out over (I think) a 24-hour period. I don't recall the exact time period off the top of my head but it's much, much longer than 20 minutes.

Jabdoa · on Dec 24, 2016

Second part is here: https://blog.pivotal.io/labs/labs/ntp-server-costing-500year...

Faaak · on Dec 24, 2016

It was mainly due to the poorly coded snapshat program: https://news.ntppool.org/2016/12/load/

EDIT: this post was indeed from 2014. My bad then. however the same issue started again two weeks ago (~17 dec 2016).

brian_cunnie · on Dec 24, 2016

Here is a visual representation of the affect of the snapchat broken-ness on my NTP server:

https://cloud.githubusercontent.com/assets/1020675/21468123/...

Note that inbound traffic which was steady at ~4k packets/sec spikes as high as five times as much. Also note that the snapchat traffic followed a circadian rhythm (much higher traffic during the daytime).

mw6621 · on Dec 24, 2016

Seems like the post was written in 2014 while the Snapchat NTP issue was more recent.

minsight · on Dec 24, 2016

It was mainly due to Virtualbox querying every 64 seconds.

sigio · on Dec 24, 2016

this post was from 2014, so I doubt it was this months snapchat issue

rupellohn · on Dec 24, 2016

Could this have been the result of an NTP amplification attack? https://www.us-cert.gov/ncas/alerts/TA13-088A

moxious · on Dec 24, 2016

Article said no, because the traffic was symmetrical and not lopsided. If this had been part of an attack you'd expect to see far more outgoing bandwidth than incoming.