Intro
One of my ntp servers hadn’t been patched in awhile so it was time. Other systems rely on it for time synchronization. The next morning after the patching I noticed that the ntp service wasn’t even running. I started it and went about my business. Checking back some minutes later, it had died again. What happened, and how to get it fixed? Read on to see how we diagnosed and solved this puzzler.
The details
I like to use ntpq -p to query my ntp server – it’s easy to type! So when I started it up the results looked like this:
> ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
*LOCAL(0) .LOCL. 10 l 59 64 17 0.000 0.000 0.001
drjegw.drjn.com 192.5.41.209 2 u 56 64 17 0.335 3605497 26.136
drjegw2.drjn.co 192.5.41.41 2 u 56 64 17 19.241 3605534 39.621
drjeuro.drjn.eu 128.252.19.1 2 u 60 64 17 105.970 3605532 38.946 |
remote refid st t when poll reach delay offset jitter
==============================================================================
*LOCAL(0) .LOCL. 10 l 59 64 17 0.000 0.000 0.001
drjegw.drjn.com 192.5.41.209 2 u 56 64 17 0.335 3605497 26.136
drjegw2.drjn.co 192.5.41.41 2 u 56 64 17 19.241 3605534 39.621
drjeuro.drjn.eu 128.252.19.1 2 u 60 64 17 105.970 3605532 38.946
That’s some offset, eh? 3.605 x 10^6 msec, or, when you think about it, just over an hour. And yet the local clock had no offset. Strange.
Date
I like to do a crude check of system time by running the date command – quickly – on two different systems. Lacking some sleep, I noticed eventually but not right away, that my ntpd server had a date that was retarded by almost exactly an hour. I didn’t notice it at first because I had trained myself to only look at the seconds, which were “only” off by five seconds.
I checked to see if the timezone or localization settings had been changed by the patching – they hadn’t. So I went ahead and advanced the system clock by an hour. Actually the yast GUI of SLES gave me the option to sync against a time server, so I chose my closest one and did that after I had stopped ntpd.
Next problem, please
That got the time in the ballpark. But ntpd still wasn’t behaving. It exhibited a strange behaviour I’ve never seen before – its offset kept increasing. I observed this behaviour over the course of several minutes:
> ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 3 64 377 0.000 0.000 0.001
*drjegw.drjn.com 192.5.41.209 2 u 129 128 377 0.350 146.846 81.771
+drjegw2.drjn.co 192.5.41.41 2 u 1 128 377 20.211 183.047 97.286
+drjeuro.drjn.eu 128.252.19.1 2 u 72 128 377 104.931 161.696 79.561 |
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 3 64 377 0.000 0.000 0.001
*drjegw.drjn.com 192.5.41.209 2 u 129 128 377 0.350 146.846 81.771
+drjegw2.drjn.co 192.5.41.41 2 u 1 128 377 20.211 183.047 97.286
+drjeuro.drjn.eu 128.252.19.1 2 u 72 128 377 104.931 161.696 79.561
> ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 5 64 377 0.000 0.000 0.001
*drjegw.drjn.com 192.5.41.209 2 u 2 128 377 1.803 182.380 97.636
+drjegw2.drjn.co 192.5.41.41 2 u 3 128 377 20.211 183.047 97.286
+drjeuro.drjn.eu 128.252.19.1 2 u 74 128 377 104.931 161.696 79.561 |
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 5 64 377 0.000 0.000 0.001
*drjegw.drjn.com 192.5.41.209 2 u 2 128 377 1.803 182.380 97.636
+drjegw2.drjn.co 192.5.41.41 2 u 3 128 377 20.211 183.047 97.286
+drjeuro.drjn.eu 128.252.19.1 2 u 74 128 377 104.931 161.696 79.561
> ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 28 64 377 0.000 0.000 0.001
*drjegw.drjn.com 192.5.41.209 2 u 89 128 377 1.803 182.380 97.636
+drjegw2.drjn.co 192.5.41.41 2 u 90 128 377 20.211 183.047 97.286
+drjeuro.drjn.eu 128.252.19.1 2 u 32 128 377 104.667 197.864 96.296 |
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 28 64 377 0.000 0.000 0.001
*drjegw.drjn.com 192.5.41.209 2 u 89 128 377 1.803 182.380 97.636
+drjegw2.drjn.co 192.5.41.41 2 u 90 128 377 20.211 183.047 97.286
+drjeuro.drjn.eu 128.252.19.1 2 u 32 128 377 104.667 197.864 96.296
> ntpq -p
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 5 64 377 0.000 0.000 0.001
*drjegw.drjn.com 192.5.41.209 2 u 4 128 377 1.813 218.325 113.345
+drjegw2.drjn.co 128.118.25.5 2 u 5 128 377 19.667 219.077 113.157
+drjeuro.drjn.eu 128.252.19.1 2 u 75 128 377 104.667 197.864 96.296 |
remote refid st t when poll reach delay offset jitter
==============================================================================
LOCAL(0) .LOCL. 10 l 5 64 377 0.000 0.000 0.001
*drjegw.drjn.com 192.5.41.209 2 u 4 128 377 1.813 218.325 113.345
+drjegw2.drjn.co 128.118.25.5 2 u 5 128 377 19.667 219.077 113.157
+drjeuro.drjn.eu 128.252.19.1 2 u 75 128 377 104.667 197.864 96.296
Look at that offset column. See? It keeps going up, at about a rate of 40 msec every two minutes. It ain’t supposed to do that!
So a Unix pal of mine said he had encountered an issue in ntp and had commented out that local clock. I honestly had absolutely no idea what that LOCAL line did, but it had never hurt before.
The local clock comes from these lines in ntp.conf:
server 127.127.1.0 # local clock (LCL)
fudge 127.127.1.0 stratum 10 # LCL is unsynchronized |
server 127.127.1.0 # local clock (LCL)
fudge 127.127.1.0 stratum 10 # LCL is unsynchronized
So I took those out, stopped ntpd with a sudo service ntp stop, synced the time with a sudo sntp -P no -r drjegw.drjn.com, and restarted ntpd. It didn’t work immediately, but it became apparent eventually that it was working.
Meantime I discovered the ntpdc command, which is kind of informative in this situation:
> ntpdc
ntpdc> loopinfo
offset: 0.097373 s
frequency: -132.558 ppm
poll adjust: 12
watchdog timer: 841 s |
offset: 0.097373 s
frequency: -132.558 ppm
poll adjust: 12
watchdog timer: 841 s
This tells me the offset if 97 msec (already too large in my experience) and that for some reason the system clock hadn’t been adjusted in 841 s, and that the clock drift rate was -132 ppm – much, much higher than any other system
Then in a few minutes it clicked and got the offset in order:
ntpdc> loopinfo
offset: 0.000000 s
frequency: -132.558 ppm
poll adjust: 4
watchdog timer: 11 s |
offset: 0.000000 s
frequency: -132.558 ppm
poll adjust: 4
watchdog timer: 11 s
So removing the local system clock seemed to be working. But what was the real cause of all this? I discussed it with an admin. Bear in mind that this is physical server. He said the system clock gets its time from a hardware clock which should be visible in the ILO. We checked it. Sure enough, there it was, reporting in the ILO – still, after we had fixed the problem at the OS level – as one hour retarded. There was no way to manually adjust it. The only option was to set up sntp servers, which we did, which forced the ILO to restart.
We logged back in to the ILO and voila, the time was right!
I now realize that in the OS the LOCAL Time was using that hardware clock, which must have drifted by an hour since the system was installed.
Before the patch the system was incrementally keeping up with the drift, making the necessary incremental changes periodically. But the discrepancy was too large for it after rebooting after the patching. In the /var/log/ntp I even see a line:
14 Sep 07:20:53 ntpd[10259]: time correction of 3605 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time. |
14 Sep 07:20:53 ntpd[10259]: time correction of 3605 seconds exceeds sanity limit (1000); set clock manually to the correct UTC time.
Conclusion
Now the system is better and we have:
ntpdc> loopinfo
offset: 0.057029 s
frequency: -5.465 ppm
poll adjust: 4
watchdog timer: 24 s |
offset: 0.057029 s
frequency: -5.465 ppm
poll adjust: 4
watchdog timer: 24 s
That’s better, but the offset of 57 msec is still far larger than normal. But it’s useable for now.