Ntpd won't save you from one particular rogue bit

wruza · on Sept 28, 2017

>Time is hard.

I learned that once tried to implement "universal" datetime library for fun and education. There is wall clock, which leaps for political decisions; universal time coordinated which leaps on schedule to ease astronomical differences; atomic time of few sorts; numerous computer clocks; time zones which can go back and forth for north and south, land and sea; non-gregorian calendars and the fact that there is no 0 AD and no 0 BC; that dates were offset for many days few times in gregorian history; special and general relativity errors; and of course integer overflow issues.

I may have missed few points, but still, time is hard.

Dylan16807 · on Sept 28, 2017

> universal time coordinated which leaps on schedule to ease astronomical differences

"Schedule" is a bit generous there.

Oh and there's TAI for just counting seconds but you're discouraged from using it that way because of some slightly-suspicous reasoning about retrospective calibration.

deepsun · on Sept 28, 2017

That's interesting, could you elaborate, or give a hyperlink on why TAI is discouraged?

scottlamb · on Sept 28, 2017

I think it boils down to compatibility - POSIX and SUS specify that CLOCK_REALTIME is UTC, and platform libc time conversion routines expect this. So everyone uses UTC.

On Linux, there's now a CLOCK_TAI, but I'm not sure how usable it is. Does it have the correct offset on boot? when ntpd starts? and you have to convert back and forth to interact with the rest of the world, and there's no call to get the time and the conversion factor atomically, which is what you'd probably want if you're using the same timestamp internally and externally. (I think there's some non-privileged call to get the conversion; you could call that, get the time, and then call it again; retry if it's changed. but that's annoying.) And of course this is non-portable.

edit: additionally, if you have a stored, bare TAI value, what can you do with it other than see how many seconds older it is than the current time? nothing on a standard system stores when previous leap seconds happened iirc so you can't convert to UTC (and thus can't convert to civil time) unless you roll your own table (and conversion routines) or always store the offset with every TAI (increasing storage overhead).

Dylan16807 · on Sept 28, 2017

https://pairlist6.pair.net/pipermail/leapsecs/2014-October/0...

"TAI and UTC have a fixed offset relationship, it is true. However, UTC is computed in real time (with several varieties to choose from if you care about the nano-seconds), but TAI is a retrospective timescale that's not computed until after the fact. I get the feeling that the BIPM want TAI to be their baby, free from "production" concerns that UTC has to deal with"

moviuro · on Sept 28, 2017

See this[0] article which does a good tour of numerous issues with time.

[0]http://FalsehoodsAboutTime.com

userbinator · on Sept 28, 2017

This means you can't actually persist the bad time to your RTC.

I wonder where that limit comes from, since the RTC in a standard PC stores both the year(00-99) and century(also 00-99) in BCD, so a date in 2153 should be representable.

This reminds me, one interesting thing I've noticed over the years is that early PC's RTCs were pretty accurate, but the most recent ones, as in the past few years, are horrible (as much as +/- several seconds per day), and the ones in smartphones even worse. Maybe because they're assuming NTP, so have cut costs by using less accurate crystals? I've had systems in storage for many years, and the RTC was within a few seconds of the current time when turned on again.

JdeBP · on Sept 28, 2017

A "standard PC" is not in fact required to have a century register. One can obtain from ACPI information where it is if the hardware has one but this is an optional feature and PCs are not required to have a valid register number in their where-the-century-register-is fields. Indeed, the entire MC146818 compatibility system is optional as far as modern ACPI is concerned. There's an FADT flag that signals that it is not there.

dspillett · on Sept 28, 2017

I can't say I've ever had a PC with a particularly accurate RTC, though in recent years I've run NTP on everything I can so not noticed any effects.

Anecdotally clocks in small mobile devices like phones do seem less accurate though, at least when they don't have access to the network to update themselves. I've not looked into it but assumed temperature variances, which a device like a phone will experience to a larger extent, were part of the problem.

Of course phones syncing with the network relies on their clocks agreeing with the rest of the world - for some months my provider seemed to be a minute or so behind which meant I had to adjust a little when gauging if I was just going to make train I was rushing towards or if I'd likely missed it leaving already...

jofer · on Sept 28, 2017

I've noticed the same thing, but I'm always surprised by it. Particularly with smartphones.

A GPS is basically just a very, very precise clock and a radio receiver.

Why don't smartphones use the GPS clock?

I'm sure there's a good reason (and some probably do), but it sill surprises me. (Presumably it's battery related -- the GPS clock can't be used unless GPS is fully enabled or something similar.)

Lx1oG-AWb6h_ZG0 · on Sept 28, 2017

The precise clock is located on the satellites, not receivers. The receivers figure out the correct time by listening for the signals from multiple satellites and matching them against each other and it’s own internal tables. This is the expensive part.

Piskvorrr · on Sept 28, 2017

Indeed, but the GPS fix you're getting is 4-dimensional: if you have position, you also have precise time. Therefore - listening for the GPS time is not necessary all the time, just update when a precise-enough fix is achieved.

monochromatic · on Sept 28, 2017

Probably cheaper in terms of battery life to just use NTP than to periodically get a GPS fix.

Dylan16807 · on Sept 28, 2017

It's worth noting that to get time with .1s accuracy you don't actually need a GPS fix, just to pick up the 50-bit-per-second navigation signal for a couple seconds, from one satellite.

monochromatic · on Sept 28, 2017

How would that work unless you know the distance to the satellite?

nucleardog · on Sept 28, 2017

Based off of some quick and dirty calculations for a GPS satellite at approximately 20,000km above sea level, the difference in distance between a satellite directly overhead and one straight out over the horizon is around 5500km.

That's actually only about 18ms difference in delay. Unless there's something I'm failing to account for here, you should be able to get more than 0.1s precision based off of a single satellite's signal.

Subtract a flat ~75ms to account for the minimum delay (~66ms @ 20,000km) and half of the variable delay, and you shouldn't really be more than about 9ms out.

Dylan16807 · on Sept 28, 2017

The distance to an overhead GPS satellite is somewhere in the range of .07-.09 light-seconds. The planet's width is basically irrelevant when your goal is .1 second accuracy.

p1mrx · on Sept 28, 2017

If you're on Earth, you can assume that the satellite is closer than 30000 km or so. Light takes 0.1s to cross that distance.

Piskvorrr · on Sept 28, 2017

Of course. I was thinking "when another app is requesting a GPS fix anyway, might as well use the temporal part, as you're getting it for free."

lucaspiller · on Sept 28, 2017

I have a heating thermostat that has optional ‘smart’ features that allow you to control it via your phone. I tried it when I first got it, but it seemed a bit pointless, so I turned off the feature and the wifi connection. That was a month ago, and the onboard clock has now drifted by around 15 minutes...

wruza · on Sept 28, 2017

>and the ones in smartphones even worse

Oh, I bet my tooth that I can hear my phone playing music slightly faster than usual sometimes. Never experienced that on PC. Maybe it is not connected to RTC and is purely biological condition (never cared enough to test two devices side by side), but if someone experienced that too and has an explanation, it would be great to know.

wyldfire · on Sept 28, 2017

The RTC is not a factor for problems in music playback on phones or elsewhere. It's only for "wall clock" time, and generally only for when the system boots. The OS timer tick that drives the scheduler is often generated by the CPU and some operating systems don't use it anymore -- Linux has offered a tickless config for a while that I think many android configs capitalize on for the sake of conserving battery. The scheduler could be a culprit for playback discontinuity (stalls) but not problems with the tempo.

I'm wondering -- "What might cause music playback to be faster than normal?" I don't think I know enough to tell. Perhaps the clocks on the audio playback device are skewed? Most SoCs use a dedicated DSP which doesn't seem terribly different from PC's. Not sure, but we can say for certain that it's definitely immune to changes in wall clock time (whether in the RTC or its representation in the OS).

tonyarkles · on Sept 28, 2017

I had a USB sound card that, on windows, would play back at 48kHz instead of 44.1kHz. Slight pitch shift upwards, playback at 8% faster. Most apps just let the audio driver provide backpressure and don't actively try to send data at the rate the soundcard is expecting, they just feed bytes when asked.

Even more fun was that it was a dual-boot machine, and the audio worked perfectly in Linux but shifted up in Windows. I honestly thought I was losing my mind for a while. I'd listen to the same song in Spotify on both OSes and just get this "something is wrong here..." feeling in my gut.

etatoby · on Sept 28, 2017

Yes, I've definitely heard it.

Which does not prove it's a fault in the device—it could be a fault in my fried brain—but anecdotally, I've never noticed it on any other music playing device, since the time of the original walkman (and in that case the music would usually slow down due to attrition or other mechanical faults, but rarely if ever speed up.)

Piskvorrr · on Sept 28, 2017

On some (tape) devices, you could fast-forward while the reading head was still engaged, i.e. playing. This would provide the speedup effect;)

cellularmitosis · on Sept 29, 2017

If the pitch sounds higher, then your phone is actually playing back the song faster. If it sounds faster without the pitch being higher, then its just a perception-of-reality thing.

monochromatic · on Sept 28, 2017

> cut costs by using less accurate crystals

Maybe... but how much money could they possibly be saving? Quartz oscillators like the ones used in cheap digital watches are plenty accurate and completely abundant. I can't imagine they cost very much.

Maybe the issue has to do with miniaturization? If you're trying to make smaller and smaller devices, there could be an acceptable tradeoff between size and precision.

agoetz · on Sept 28, 2017

Most RTCs do use the same crystals that are in wristwatches, which is actually the source of their error.

The tuning fork crystal design used in wrist watches has a parabolic temperature coefficient, which means that the clock is only really accurate at room temperature. This isn't a problem for wristwatches, because, presumably your wrist is at approximately room temperature, but it does become a problem for electronics that operate with large temperature swings. (like the inside of a phone or computer, for instance).

https://www.maximintegrated.com/en/app-notes/index.mvp/id/58

Florin_Andrei · on Sept 28, 2017

The vast majority of quartz references in consumer devices could be a lot better, but they are not fine-tuned at the factory. They're just "close enough". So they drift a lot.

A simple tuning procedure could make almost all quartz clocks you own a heck of a lot better. But in many cases this is hard to do because the manufacturer never put the tuning components (a variable capacitor or the like) on the PCB.

monochromatic · on Sept 28, 2017

That’s interesting. I’d never thought about it, but of course there’s a temperature dependence, and of course attaching it to a wrist mitigates it.

dragonwriter · on Sept 28, 2017

> This reminds me, one interesting thing I've noticed over the years is that early PC's RTCs were pretty accurate, but the most recent ones, as in the past few years, are horrible

Insofar as sensitivity to thermal conditions is an issue, early PCs tended to be big, airy, ventilated boxes and have less bits that would ramp up to high temperature under load but drop down in other operating regimes. So that may be a factor.

Bender · on Sept 28, 2017

Try disabling the power management features for a bit and see if your drift is the same. On NTP servers, I always have to disable cpuspeed, c-states, intel power mgmt, etc.

jandrese · on Sept 28, 2017

This would also explain why phone hardware is worse off than PC hardware, it is much more aggressive about managing power.

gerdesj · on Sept 28, 2017

Well that was fun! I tried out the timeshift program and my PC duly went a bit mad. ntpq -p showed all being well.

Chromium threw a fit because news.ycombinator.com's SSL cert had expired and offered to reset the clock which it could not do, given that I'm not in the habit of running apps as root. My Kerberos tickets all expired so Evolution lost contact and so did quite a few other things. MariaDB dumped core. systemd's timers went berserk so things like my LetsEncrypt cert tried renew itself.

I'm still hunting through some odd looking log files but overall things seem to be back to normal when I reran timeshift and returned to current time.

winkywooster · on Sept 28, 2017

Similarly with my Mac, within moments it became unusable. Even trying to reboot from the command line and I got a message that it was waiting to acquire a lock on /. Hard reset and everything is back to normal.

netsharc · on Sept 28, 2017

iOS devices had a bug that hard locks it if you set the time to 1/1/1970 (on US timezones, that sets it to negative territory), I can't believe they don't do any tests for their OSes to survive this...

ReverseCold · on Sept 29, 2017

Is there a problem with locking current time to the release date of the product?

whouweling · on Sept 28, 2017

Interesting edge case! This is why it may be sensible to do some extra checks in things like product expiration batch jobs, to check if the previous run was not to far in the future or past and refuse to run in that case.

Scary to think what might happen if some database purge process is running after this bit flip!

viraptor · on Sept 28, 2017

This is going to give me actual nightmares. Not only database cleanups, but there are quite a few automated backup cycling scripts which would happily throw out terabytes of data this way...

nolok · on Sept 28, 2017

As parent said, it's usually a very good idea to check for time delay and ensure it "makes sense" before doing anything destructive. Doesn't cost much, can save a lot, and adding a command switch to bypass it allows to skip the "haven't booted in a couple week and now everything refuses to run" problem.

A much more common case of when such a check help is "last time run is in the future", that you face every time there is a clock issue. Some scripts... Don't react too well about that.

wyldfire · on Sept 28, 2017

One source of RTC weirdness is the fact that inb/outb to/from the RTC and other legacy parts of the ISA aren't protected by mutex on linux. So if you're unlucky enough to collide with some other program doing something innocuous with the RTC you can accidentally set bits that will never be set by the RTC. Many RTCs decide not to tick anymore when that happens.

If all your programs use /dev/rtc and ioctl()s you are probably a little safer because there will be coarse locks around the RTC itself, those will serialize the activity. But IIRC the inb/outb stuff can be done from user space (as superuser) and even if you're only reading you have to write to the address register which could break a write-in-progress by sending its output to the wrong RTC field.

dwyerm · on Sept 28, 2017

I've been chasing exactly this issue for weeks!! The problem I'm seeing is that something changes the binary/BCD flag. It turns out that changing this flag doesn't convert the current representation of time inside the RTC; it just changes the logic used to advance the counter fields and apply the next tick.

So if you change that flag without also setting the time, then many of the fields are now invalid. But the tick logic doesn't care. It just sets the invalid fields back to zero and keeps on ticking.

The year 20 17 becomes 1 4 1 1.

The RTC clock is a 1984 part that is still getting embedded, more or less unchanged, into today's PCs. It is maddening.

exikyut · on Sept 28, 2017

> Many RTCs decide not to tick anymore when that happens.

Of course I laughed when I read that, but then I realized I was interpreting that line to mean "it declares shenanigans and stops reporting time so you realize something broke."

Just wanted to clarify - do you mean the above, or "of course software cannot kill hardware!!1" "won't tick anymore"?

wyldfire · on Sept 28, 2017

Yeah, IIRC it really stopped ticking (until you write a valid value at which point it would resume).

Note that it doesn't stop reporting the time in this condition it will just give you back all the garbage fields that you wrote before. So an interesting thing happens when the system tries to transform that into a UTC wall clock basis and it usually ends up with a really wild interpretation of the date (decades/centuries off, similar to the problem described in TFA).

exikyut · on Sept 29, 2017

> Yeah, IIRC it really stopped ticking (until you write a valid value at which point it would resume).

Ah, okay then. Good to know I can't accidentally thousands of dedicated servers :P (ie, via NTP MITM, writing to hw RTC...)

> Note that it doesn't stop reporting the time in this condition it will just give you back all the garbage fields that you wrote before.

I don't know why I didn't remember this last night: Linux uses the RTC as a poor-man's NVRAM that will persist across a reboot. Provides <24 bits of data to work with (yay! ...not). https://wiki.ubuntu.com/DebuggingKernelSuspend, useful info in https://github.com/torvalds/linux/blob/e34bac726d27056081d02...

> So an interesting thing happens when the system tries to transform that into a UTC wall clock basis and it usually ends up with a really wild interpretation of the date (decades/centuries off, similar to the problem described in TFA).

Right.

poizan42 · on Sept 28, 2017

> If the resulting time that ntpd sees is more than a few milliseconds off, it'll step the clock, and that will clear out the future time.

It shouldn't do that, otherwise it's going to break in 2036.

(The "future time" it's going to clear off will set the time back to 1900 once we are past February 7, 2036)

Edit: I don't know if I didn't explain myself well enough. Because the protocol only has 32 bits of seconds it cannot tell apart September 27, 2017 and November 4, 2153. This means that ntpd absolutely must trust that it is in the correct 68-year span. But according to the author ntpd "clears out the future time" if it is more than a few milliseconds off. This violates the spec and is also inconsistent as it happily keeps up with the future date as long as it doesn't have to step the clock.

nolok · on Sept 28, 2017

I often see this type of comment on many subject and I find it super weird in a way. I mean, if it took you less than a minute after reading it to spot the issue despite not being familiar with the details before, it's usually safe to assume whoever made the specs and spent days on it also did.

poizan42 · on Sept 28, 2017

Where did I disagree with the spec?

nolok · on Sept 28, 2017

I didn't say you did? You pointed an obvious error. I'm saying, yes, I'm pretty sure we can assume it's been handled already given how easy it is to spot (and the other comment to your message seems to imply it is indeed)

poizan42 · on Sept 28, 2017

You made it sound like? Of course it is handled by the spec, the issue here is in ntpd not following the spec (assuming what the author says is true, I haven't checked).

Or actually there is this little tidbit from RFC 5905:

> Eras cannot be produced by NTP directly, nor is there need to do so. When necessary, they can be derived from external means, such as the filesystem or dedicated hardware.

So it could be that ntpd looks at some file or the rtc to determine the era rather than assuming the current system time is in the correct era, and it would be allowed by the spec. But it's quite inconsistent if it only does it if the system time is more than a few milliseconds off (presumably beyond the limit of when it corrects the time by time stretching). I'm going to go with it just being a bug in ntpd.

nolok · on Sept 28, 2017

Sorry if you understood me that way. I was reflecting on some things that I see often on hn, eg a common case is when Google annonce a change in crawling and people go "but it can be gamed by...".

Your message merely reminded me of that and so I put my message here as I saw similarity. Having not read the ntpd spec in question I can't answer you.

jlgaddis · on Sept 28, 2017

It won't, this is accounted for in the spec; cf. "pivot dates" (IIRC).

poizan42 · on Sept 28, 2017

I know that it shouldn't, but the author said that it did in the blog post.

jlgaddis · on Sept 28, 2017

It won't save you from this particular case but you're safe so long as your clock is off by no more than ~68 years.

cgsmith · on Sept 28, 2017

The other use case is manufactured hardware with their time set for the unix epoch and depend on it for updating.

petecooper · on Sept 28, 2017

I get numerous requests to my (macOS) computer for ntpd to connect to shady subnets when I'm connected to a particular commercial VPN:

https://twitter.com/petecooper/status/911946604759977984

https://pbs.twimg.com/media/DKficrvW4AA1Hxm.jpg:large

Numerous hosts across numerous networks, perhaps two or three an hour.

I've wondered what exactly would be gained by resetting a clock to a different time – this is a useful article.

geofft · on Sept 28, 2017

That sounds like

1. your VPN provider is giving you an actual public IP address (??)

2. people are scanning your computer for NTP vulnerabilities or something (this happens if you have a public IP, regardless of network)

3. NTP is using UDP and so connectionless, and so Little Snitch can't distinguish "ntpd wants to reply to someone who contacted it" from "ntpd wants to connect to someone"

An alternative explanation for 1/2 is that your VPN provider is not isolating you from other VPN users (less surprising than giving you your own public IP) and someone else on the VPN is trying to conduct NTP amplification attacks using you: https://blog.cloudflare.com/understanding-and-mitigating-ntp...

In either case, the solution is basically to make your ntpd not listen for requests from other machines and only handle time from your local computer + initiating requests to time.apple.com or whatever your chosen NTP server is. It shouldn't be trying to reply at all to unexpected packets, even to send a refusal message (again, because UDP is connectionless, it's easy for an attacker on your LAN to send spoofed packets and convince you to send replies to some random computer on the internet, and I guess on this VPN, other customers are your LAN). I'm surprised that macOS's default NTP server isn't configured this way out-of-the-box, though.

petecooper · on Sept 28, 2017

>An alternative explanation for 1/2 is that your VPN provider is not isolating you from other VPN users (less surprising than giving you your own public IP) and someone else on the VPN is trying to conduct NTP amplification attacks using you:

This, I think, is most likely.

jandrese · on Sept 28, 2017

It seems strange that the firewall would block the outbound packet after letting the udp packet from some completely random host in.

geofft · on Sept 28, 2017

Little Snitch is not so much a firewall in the usual sense as a phone-home prevention device. It's primarily interested in blocking outbound traffic (exfiltration), not inbound traffic.

gog · on Sept 28, 2017

Which VPN provider are you talking about?

Are you sure that it is not just NTP servers being served via DHCP once the connection is established and your computer trying to use those provided by your VPN?

petecooper · on Sept 28, 2017

It's PureVPN.

I'm not 100% sure about any of the ntpd connection requests, they're not predictable in their appearance. Some sessions are very quiet (zero requests), others I get a bunch of incoming connection requests for smbd, and other odd things. I really should start taking notes rather than just deny the connections.

gog · on Sept 28, 2017

It looks like your VPN provider does not isolate traffic between clients connected to the same server. That is pretty bad, security vise.

petecooper · on Sept 28, 2017

That's what I'm thinking. Suffice to say after a bit more poking around today, I no longer use the service.

KekDemaga · on Sept 28, 2017

Leakage from other clients perhaps?

petecooper · on Sept 28, 2017

That's my gut feeling, yes.

jandrese · on Sept 28, 2017

What server is your NTP Daemon configured to use? pool.ntp.org?

That could just be different pool addresses coming up.

mnarayan01 · on Sept 28, 2017

Unless you're running it with additional configuration (e.g. the -g option), I don't think ntpd will save you from any bit flips other than the 10 least significant: You'll be outside the 1000s panic threshold.

Fnoord · on Sept 28, 2017

Its not recommended to use ntpd with -g argument in production. An attacker can MITM NTP protocol. The 1000s threshold severely limits this attack. The attack can be used e.g. in defeating TOTP.

I'm not sure if this rogue bit can be used to attack TOTP. Can anyone clarify?

tonyg · on Sept 28, 2017

We will continue to suffer problems like this so long as we continue to use languages which offer machine words in place of general integers.

simcop2387 · on Sept 28, 2017

I wonder if any of the other ntp implementations (chrony et al) suffer from this same issue?

jlgaddis · on Sept 28, 2017

Yeah, it's an issue with NTP the "protocol" not NTP the "program".

grogers · on Sept 28, 2017

To expand on this, NTP uses timestamps with 32 bit seconds (plus fractions of a second), so if you manually step your clock some multiple of ~136 years, the protocol inputs will be the exact same so you'd have no way of knowing you were off from the server.

However, you could imagine an NTP implementation which hardcodes the approximate starting time to get the right era. You'd only have to recompile every lifetime or so to keep it up to date.

jlgaddis · on Sept 28, 2017

> However, you could imagine an NTP implementation which hardcodes the approximate starting time to get the right era. You'd only have to recompile every lifetime or so to keep it up to date.

Yep, see "NTP pivot dates" [0]:

> When ntpd(8) receives a unresolved timestamp from an upstream server that timestamp could be based in any era ... To resolve this ambiguity, NTP also uses an internal pivot date ... An ntpd(8) instance’s pivot date will be the date it was compiled and built.

[0]: https://docs.ntpsec.org/latest/rollover.html#ntp_pivots

ifoundthetao · on Sept 28, 2017

Hm, I wasn't able to replicate it. I'm using a Kali Linux VM. I'll try some other OSes and see if it works on there.

Was anyone else successful with the PoC code?

gerdesj · on Sept 28, 2017

It works flawlessly on a physical system. There was rather a lot of red text in journalctl -r 8)

sp332 · on Sept 28, 2017

Is your VM host resetting the time in the guest?

ifoundthetao · on Sept 28, 2017

That's possible. I'll try it on a physical system in a bit. See what I can do.

gumby · on Sept 28, 2017

This is a nice bug! And not really a clock issue -- many programs/protocols could have such a bug.