Hacker News new | past | comments | ask | show | jobs | submit login
Why is my NTP server costing $500 per year? Part 1 (2014) (pivotal.io)
159 points by t0mas88 on Dec 24, 2016 | hide | past | favorite | 109 comments



Reminds me of when Netgear decided to use the University of Wisconsin NTP servers as the default in their consumer products: http://pages.cs.wisc.edu/~plonka/netgear-sntp/


The most frustrating part of things like this, and that Snapchat issue, is the largest abusers could probably swallow the cost of their own NTP server usage as a rounding error to their bottom line.


The university commitment to still serve the public is admirable.


How is that even legal?


Once you put an NTP server on the 'net, it's public - pretty much like most Web sites. Sure, there are reasonable expectations of decency like for anything in the Commons, but I don't think there's any legal defense against skunks at the picnic.

IIRC, the university called Netgear out for doing something stupid and disruptive, and Netgear stopped doing it. The second best possible scenario, I guess.


> Netgear stopped doing it.

Netgear issued patches for the devices. Most people never update their server firmware, and we're talking about over 700,000 devices. The university still gets considerable traffic.

https://en.wikipedia.org/wiki/NTP_server_misuse_and_abuse#NE...


Can't the university throttle non local connections ? (netgear could have provided the hardware ;)


Throttling won't do much good; their WAN interfaces will still eat the traffic, I don't think it's as much of an issue that the NTP servers were melting it's more that their entire network was.

For what you want to have any effect they'll have to sinkhole/throttle the traffic upstream before it ever reaches them and as a university they are effectively an ISP so that might not even be really possible.


Not necessary -- fortunately these routers all used source port 23457 for their NTP packets, making them trivially easy to block.

I do recommend reading the incident report posted above if you have an interest in network operations, it's quite interesting!


You still have to receive the traffic to block it.


It is a public NTP server. Throttling public defeats the point.


Not really. If an NTP client gets 1/10th of the updates that it wants, it will still keep reasonably good time.


Keeping a table of clients and their last-updated time is probably more expensive than just sending them a response.


Randomly dropping a percentage of requests is stateless.


But what happens if you are the client whose requests are dropped all the time because you are unlucky?


Then I guess you'll have to use somebody else's free server.


Is the NTP synchronization done with only a single packet? Because otherwise you'll be interrupting connections constantly.


It is UDP, so it is connectionless.


UDP is a Transport Layer protocol. NTP is an Application Layer protocol.


Fair enough, UDP doesn't imply that the application protocol is connection-less, but AFAIK NTP is.


Do NTP clients not retry immediately/very quickly if a request gets lost?


It's more that your device's clocks don't skew that rapidly, unless something is really wrong.


But dropping requests to reduce traffic load is counter-productive if failed requests are quickly retried.


Retry yes, immediately/very quickly no. Ntp is designed to handle network issues transparently.


This was one of the issues wth netgears client, it retried every second until it worked.


Incorrect clients aside, it should at worst be another 64 seconds (from memory so I might be wrong) before a client retries a poll.


> The university still gets considerable traffic

It'd be interested in seeing if there's been any update since 2003.

E.g., is it really "considerable traffic" by 2016 standards? The original flood in 2003 was 150 MBps - I don't think I'd notice if I got a flood of 150 MBps on my home connection.

How many of those devices are still around 13 years later?


I believe an "agreement was forged", no public details, but I might assume some money changed hands.


$350,000.

https://en.wikipedia.org/wiki/NTP_server_misuse_and_abuse#NE...

> NETGEAR has donated $375,000 to the University of Wisconsin–Madison's Division of Information Technology for their help in identifying the flaw.

Although that appears to be uncited.


Wow, I never realized operators couldn't push fixes to their routers without permission. The internet is indeed a tragedy of the commons: trivial to ruin, but a Sisyphean task to fix.


Most admins would consider having network infrastructure's firmware change outside of their control a bug/misfeature. Not to mention most devices would require reboot to apply change.

And to be able to remotely change the code running a HUGE security issue.


The vast majority of admins don't even know that they're admins. They bought or received a cheap Netgear router, plugged it in, and never touched it again, except to maybe turn it off and on again when the internet was slow/down.

If you're an admin who cares about their infrastructure, you're not using a bargain-basement Netgear router, and if you are, you'll have gone through every single menu and seen the auto-update option.


Sure. It's also why the internet is super vulnerable to 0-days.


Some operators do, mostly ISPs that lease routers to customers and retain a way to push firmware updates to them (for example, Comcast does this). But router manufacturers typically don't touch the device once it's out of their hands.


Note that cable modems (all of them, not just from Comcast) download their configuration from the provider every time they boot up. Ironically (since it uses TFTP, for one), this is called "secure provisioning".

They might give you a web interface where you can configure certain settings (e.g. integrated Wi-Fi) but the ISP ultimately has at least some control over any cable modem connected to it.


I wonder if they somehow mistakenly joined their server to the region-specific pr.pool.ntp.org group[1]. At the moment, that pool exists, but has no servers in it.

So, if you were the only server in the pool, perhaps you would get a lot of Puerto Rican traffic?

[1]http://www.pool.ntp.org/zone/pr


Someone else commented in the article, but it's probably related to Puerto Rico ISPs using a NAT because of lack of IPv4 address space. That single ip is probably many many people.


That one doesn't make sense to me. Most people don't have ntp configured to point to ntp.pool.org...they are mostly PC's pointed at time.windows.com. And, the pool is big enough anyway that it would spread the load from a relatively small island pretty well. NAT could be a small part of it, but there's a different primary cause.


It doesn't necessarily need to be Windows making the calls. Cell phones use a NAT typically and there was recently an issue with Snapchat DDoSing NATs.


That's an example of a "different primary cause". It's not NAT in that case, it's an app using a library with terrible defaults.


An interesting theory in one of the comments.

> I wonder if Puerto Rico has run out of its pool of IPv4 addresses. After Europe and Asia, just this month Latin America as well, have exhausted their IPv4 pools, many local ISPs have resorted to using NAT to deal with the scarcity of addresses (of course, after years procrastinating IPv6 and pretending that this day wouldn't come about). Given that the source is a Puerto Rican ISP, and one of the offending addresses from a small /21 network, it's possible that NAT is to blame. As ISP NAT increasingly becomes more prevalent, this is going to be rather touchy to deal with abuses. For is it an abuser or just several innocent users behind a NAT?


I'm not understanding why NAT would cause it. I could see something like a misconfigured forwarding DNS cache causing it. Where it only queries pool.ntp.org once, and continues returning the result in the same order (with pivotal's ip at the top of the list) to a large number of querying clients. Then, perhaps, if there are a bunch of natted clients behind one ip? NAT, on it's own, without some other contributing factor, shouldn't cause this.


NAT wouldn't cause it but hide that in fact those are many client all having the same source IP. Of course, that wouldn't explain why they observed a general increase in traffic.



I wish he'd explained somewhere how they leapt to examining virtualized NTP clients, or what they ultimately did (since there's no part 3 that I can find).


[author]

> I wish he'd explained somewhere how they leapt to examining virtualized NTP clients...

I had a hunch [wrongly] that the traffic was caused by a particular operating system. I didn't have enough machines to run the tests on bare-metal, so I virtualized them. And I suspected that virtualization would provide a worst-case scenario (the virtualized clocks would be jittery).

My big surprise was that Windows was a model client (once per day), OS X was good, too, and that FreeBSD was the worst and Ubuntu a close second. It was a complete inversion of what I had expected to find.

> what they ultimately did (since there's no part 3 that I can find).

I never wrote part 3, but now that it seems the post has gained traction on Hacker News I might be inspired to write one. The summary would be as follows:

To reduce your costs by a quarter, use the following lines in your NTP configuration file to throttle overly-aggressive clients (the important directive is `limited`:

``` restrict default limited kod nomodify notrap nopeer discard minimum 0 ```

Here is a description of NTP rate-limiting and why `limited`, `kod`, and `discard` are important: https://www.eecis.udel.edu/~mills/ntp/html/rate.html

Here is my current NTP configuration: https://github.com/cunnie/deployments/blob/95e9c71e882d453ec...


I have never understood why people think kod is a useful setting. Why do you think a misbehaving/improperly configured client is going to honor the kod packet? The kod packet helps with some clients but I have never seen it change the behavior of the most egregious abusers. Just ignore future requests from misbehaving clients, there is not a lot of benefit in saying "please stop misbehaving" to a client that does not follow the spec.

Your configuration is lacking a number two big best practices. The most glaring is that you really need to add 'iburst' to your server stanzas. After that you should think about adjust minsane and minclock.


It's not strictly harmful, I think, to presume that some percentage of misbehaving clients might just be misconfigured and honor a KoD, as long as you have other measures, unless I'm overlooking something?


I never said it was harmful.


Can you post a few more lines about this? I'm in the process of standing up a server and would like to know more about controlling the load. Thanks!


If an NTP client is already misbehaving and/or misconfigured to the point where it's considered "abusive", what are the chances the client will do The Right Thing(TM) when it receives a "kiss of death" ("kod") packet from the NTP server?

Enable KOD, by all means, but you may also consider putting in some (high) per-IP rate limiting for 123/UDP in your firewall rules as a backup plan (for if/when clients ignore kod).


Virtualization and time sync have had notorious problems. One ugly work around was frequent NTP polling and adjustments. NTP has a min and max poll interval, and it determines how frequently it should poll automatically based on how far it sees drift happening. If it drifts pretty fast, it will quickly gravitate to the minpoll value, which is exactly what they show in their first graph: tons of polling at the minfrequency for certain hypervisors.


I knew that was once an issue, I'm surprised to find it's still an issue these days.

In addition, even with minpoll, that's still a hell of a lot of clients abruptly polling him abusively.


I wonder, should this be solved with local NTP servers that resolve from the name pool.ntp.org?


I've found it to be good practice to run a single/few stratum 2 node(s) to serve local resources. To whatever degree, it's usually more important that these resources be more in sync with each other than with a satellite, which is fostered by having as few nodes as possible trying their luck over the internet to bogged down public stratum 1 NTP sources, instead configuring them to use a single source of time from a box more in its vicinity.


Most corporate environments have long defaulted to having a strata 2 NTP server, as long as they use Windows since Kerberos requires clocks to be roughly in sync.

With the advent of the cloud I see a lot of businesses forgoing any central IdM solution like AD or IPA and as a result don't tend to have an NTP server unless they explicitly configured one.

All of my servers are joined to an AD or IPA domain, so they all use my local NTP servers by default.


I too was expecting some sort of conclusion.


[author]

My bad. Now that the post has been picked up on HN, I'll try to write up a conclusion.


Hats off to everyone contributing to public services like this.

My then company wanted to give back ny doing this many years ago and it was an eye opening experience. We had troubles almost immediately with utilization and script kiddies. The company ended up only doing it for a relatively short period and ended up making contributions to projects instead


[author]

Thanks, Spooky23. As you pointed out, contributing to the community takes more time than originally expected.


Our student-run computing club added a machine to the pool and melted the University's firewall. Oops.


Honestly, that's the University's fault then. Properly configured, it should've had very little noticeable effect on the firewall (i.e. "permit udp any host 10.11.12.13 eq 123") as there's no need to do any inspection or tracking state ...

... unless they saturated the available bandwidth but, really, that's a different issue (although also preventable!).


I think this is great look at walking through the analysis. I too experienced a huge spike in NTP traffic in 2014 but it was because of people exploiting NTP for reflection attacks to DDOS other parties. The forced me to use a GPS module and a Beaglebone Black as an internal time server (which has been great)


I have a few questions about that if you have a minute:

What GPS module did you go with and is it still available? Did you have problems getting signal inside (need to be by a window, run an antenna, etc)?


I used the Adafruit "ultimate" GPS module (https://www.adafruit.com/product/746) which has the 1pps output and can connect to an external antenna. Then I got this antenna (https://www.adafruit.com/product/960) and this adapter (https://www.adafruit.com/product/851). Soldered a header connector to Beaglebone protocape (https://www.adafruit.com/product/572), wired it to the serial port and PPS to the GPIO pin (just like this: https://web.archive.org/web/20131209092059/http://the8thlaye...).

I put the GPS antenna on my window sill, it has no problem at all staying locked. My plan had been to stick it outside the window but turned out not to be necessary.


I've also used the Adafruit module, although with a Raspberry Pi. FWIW, the PPS signal seemed to be a little off from another receiver though (Garmin GPS 18x LVC). I never tried to pinpoint the issue but I strongly suspected the Pi. No noticeable issues w/ the antenna inside on a window sill (should be fine as long as you can see four satellites).

I haven't tried it myself but I've heard of several other good experiences w/ the BeagleBone Black. The Garmin seemed to work the best for me, although it is a little more expensive. I was strongly considering putting a few of them in $work's (private) facilities as a fun, nerdy project but I never got around to actually doing it. The Garmin with a BBB might very well be a great combination for that.

One other thing: make sure you use a "real" serial (or parallel) port -- not a USB to serial adapter!


Because you use AWS and they charge insane fees for outgoing bandwidth.


This $500 would have been $60, on a Digital Ocean box. DO has 1 TB/month limit. Their usage was 300 GB/month.


[author]

Digital Ocean is a great deal! Thanks for pointing that out.

The reason I use {aws,azure,google} to host my NTP servers is that my day job is developing a VM orchestrator (BOSH) for Cloud Foundry, and BOSH doesn't support Digital Ocean yet (AFAIK). But that's a personal choice, and an admittedly expensive one.


Why not just use the public pool? A virtualized NTP server isn't the ideal scenario and there are plenty of "not quite 100% public" NTP servers you could use as well (i.e. you first have to send an e-mail to get access).


Funnily enough, last time I had NTP troubles, it was on a BOSH deployment. BOSH wants to use ntpdate to keep time (because real NTP is too normal?) and we'd overzealously configured the security groups around our NAT setup so the machines couldn't reach their NTP server. Whoops.

It'd be nice if BOSH could have detected this problem and warned us about it, ideally during deployment. But it'd be even nicer if we didn't suck at configuring AWS. If you could fix either of those things, that would be great!


What about amazon lightsail?


It also runs on AWS so even though they try to trick you into thinking you don't pay for bandwidth you do. Also small EC2 instances are throttled so their server would probably not function most of the time.


They include 1TB BW in the $5 instance, that's $90 savings.


Literally any of the major providers like Linode or Digital Ocean give you much more reasonably priced bandwidth. You don't get all the nice AWS tools, but if all you use in EC2 or S3, you can still use Vagrant or Terraform plus some type of configuration management (Ansible, Puppet, etc.) to provision servers programmatically.


I'm surprised there are no comments about the fact that these guys decided to run NTPD on a VM.


[author]

NTP runs fairly decently in a VM. Don't take my word for it — look at the graphs of my servers:

Here's my Google VM, notice the jitter is within +/- 5 milliseconds:

http://www.pool.ntp.org/scores/104.155.144.4

Here's my Hetzner VM (Germany). +/- 10 milliseconds, though I can't help but suspect the distance from the monitoring station (Los Angeles) may have more to do with it than being a VM:

http://www.pool.ntp.org/scores/78.46.204.247

Here's my AWS VM. Much worse than Google in that it's +/- 50 milliseconds, but still good enough to pass muster with pool.ntp.org:

http://www.pool.ntp.org/scores/52.0.56.137

Here's my Azure VM. It's in Singapore, and I re-deployed it last night, so the numbers are still coming in, but it has a pretty tight distribution:

http://www.pool.ntp.org/scores/52.187.42.158


Everyone's needs differ, I suppose, so some might consider that "decent". 10ms -- or even 50ms -- might be acceptable for many (most?) use cases but not for me.

From a quick look, my own (stratum 2) server in the pool currently has an offset of just under 1/20th of one millisecond.

Regardless, thanks for contributing to the pool!


For another data point, here's my Digital Ocean droplet in Bangalore which sees from 10-60k queries per second depending on the time of day: http://www.pool.ntp.org/user/bradfa


Interesting stats. Thanks!


I had the same thought. Does the world really need another (presumably) stratum-3 server running in Amazon's cluster, when Amazon already runs a pool of stratum-2 servers? ([0-3].amazon.pool.ntp.org).


By my recent reading of the AWS docs, those pool addresses are not run by Amazon. They are DNS names that allow NTP load from AWS to the NTP pool to be distributed more fairly.


It's a vendor zone, anyone can register for a vendor zone...


Ok, interesting.


You're not supposed to run an NTP server on a VM. The CPU cycles can be taken from the guest and perhaps used by the host or another guest. Other options that are available, is to drill a hole in the data center and run a GPS receiver to the roof and get the time from the GPS satellites. I think that is something many VPS/cloud providers will not allow.

Edit: s/your/you're/


t1.micro to boot - with more than the usual nondeterministic statmuxing and arbitrary "fairness" policy.


[author]

Yeah, maybe that's why my AWS instance is the most jittery of my 4 timeservers (Google, Hetzner, Azure).


Your website says:

Pivotal bridges the Silicon Valley state of mind, modern approach and infrastructure with your organization’s core expertise and values. Who we are and what we do together can reshape the world

Your article suggests otherwise.


Why are people joining VMs to the NTP pool? These servers should be identified by address space and blacklisted.


Why?


Because VMs themselves might not be able to keep track of time accurately (potentially inconsistent tickrate) the way that a bare-metal setup would be able to. That's why they should be mere consumers (as in sync their time to whatever the remote says rather than contribute to the pool).


Correct, unless you have a very specific VM configuration where you are truly dedicating a CPU/core to a VM, it's not fit for being an NTP server.


I ran an NTP server on a Raspberry Pi for some time.

The bottleneck I kept hitting was the 65535 NAT translation limit on my Cisco router, at which point, load was quite manageable on the Pi.

It's extraordinary how much traffic one cheap device could service.


This is a 2 year old article. Where's the follow up? What did they end up doing?

EDIT: AHA! Part 2: https://blog.pivotal.io/labs/labs/ntp-server-costing-500year...


Why would anyone run an authoritative time service on a virtual server in the first place? My experience is that system time suffers greatly from noisy neighbor.


... because it's hosted on the cloud and you have no amount of free bandwidth with your vps ?


I'm not sure if I've missed it, but is the question (from the title) ever answered? The discrepancy between expected traffic volume and actual traffic volume is huge and seemingly unexplained.


[author]

My bad — I never wrapped it up. Thanks to the HN interest, I'll try to write Part 3 over the winter break.

The short version is this: it's gonna cost a couple of hundred dollars to run a 1Gbe NTP server in pool.ntp.org, but you can tweak the ntp.conf to save ~$100.


This is the Snapchat bug reported yesterday, right?

Incidentally, how is AWS dealing with the leap second next week? Google is going to have their time servers start to run fast around 20 minutes in advance of the leap second, so they're back in sync at 00:00:60 UTC.



Nope, this is from 2014.

If memory serves, Google's "smoothing" the second out over (I think) a 24-hour period. I don't recall the exact time period off the top of my head but it's much, much longer than 20 minutes.



It was mainly due to the poorly coded snapshat program: https://news.ntppool.org/2016/12/load/

EDIT: this post was indeed from 2014. My bad then. however the same issue started again two weeks ago (~17 dec 2016).


Here is a visual representation of the affect of the snapchat broken-ness on my NTP server:

https://cloud.githubusercontent.com/assets/1020675/21468123/...

Note that inbound traffic which was steady at ~4k packets/sec spikes as high as five times as much. Also note that the snapchat traffic followed a circadian rhythm (much higher traffic during the daytime).


Seems like the post was written in 2014 while the Snapchat NTP issue was more recent.


It was mainly due to Virtualbox querying every 64 seconds.


this post was from 2014, so I doubt it was this months snapchat issue


Could this have been the result of an NTP amplification attack? https://www.us-cert.gov/ncas/alerts/TA13-088A


Article said no, because the traffic was symmetrical and not lopsided. If this had been part of an attack you'd expect to see far more outgoing bandwidth than incoming.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: