30 thoughts on - Leap Second

  • My oracle database running on CentOS 6 didn’t love it :-(

    Some java processes were >100% CPU after the leap second was added.

    Rebooting…

    Mogens

  • You could have just done:
    service ntpd stop; date -s “`date`”; service ntpd start
    Fixed here without even stopping any jvm.

  • From: bob

    To: CentOS mailing list
    Sent: Sunday, July 1, 2012 9:55 AM
    Subject: Re: [CentOS] leap second

    I had a VM crash, but it was on an old 2.4 kernel.

  • The interesting thing to me is that my c5 systems just kept on ticking but my c6 systems had the load go through the roof and fill the logs with things like the following:

    Jun 30 19:59:59 casper kernel: Clock: inserting leap second 23:59:60 UTC
    Jun 30 19:59:59 casper tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
    Jun 30 19:59:59 casper tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable
    Jun 30 19:59:59 casper tgtd: work_timer_evt_handler(89) failed to read from timerfd, Resource temporarily unavailable

    Regards,

  • Hi Morgens,

    same problem here … OpenNMS hat 100% CPU and didn’t do anything anymore.

    Rebooting is not necessary, though. For me it worked to just set the time manually once, and everything was back to normal.

    It doesn’t strike me as a particularly good idea to insert a ‘:60’ second – software that does proper sanity checks on date/time values is supposed to barf on that.

    Peter.

  • I thought this was some sort of late April fools joke,
    untill I read the article about ntpd on slashdot.

    My CentOS 5.8 box is running ntpd, and I did not notice any problems with it. I do a weekly yum update early Sunday mornings, but AFAIR I have not rebooted the box yet.

    Checking qps, it tells me the uptime is 4 days 23hours, 53
    minutes.

    Kind Regards,

    Keith

    ———————————————————

  • I’m sort of curious about how a bug of this magnitude slips through the QA process (into java and RHEL, not CentOS). With all the furor about y2k, did no one even bother to simulate a leap second ahead of the real occurrence?

    I don’t think it affected 5.x.

  • Hi Keith,

    I did not have any problems on CentOS 5.8, but on one CentOS 6.2 box running a Java application.

    Kind Regards,

    Peter.

  • Interesting, but I thought that ntp clients always advanced the clock by small fractions of a second anyway even when the master source differs by more.

  • The java one seemed to be a pretty sure thing. Was this just openjdk or was the current Oracle version affected too?

  • I had problems with Firefox on four computers running fully updated CentOS 6.
    Firefox was suddenly taking up a lot of CPU power showing nothing but a blank webpage, on all four computers. Closing and re-opening Firefox didn’t fix it,
    logging out and back in didn’t fix it, but rebooting the machines did.

    Some google searching indicates to me that there was a problem with Firefox using a futex that got confused by the leap second, and getting into a loop.

  • Hi Les,

    they do. But the leap second is quite a different thing: Actually the time doesn’t really diverge from the server’s, but the stratum 1 server deliveres a totally unexpected 01:59:60, and the stratum 2 server follows.

    The Google approach is not to use that time at all, but slow the clock down a bit on the stratum 2 server so that the stratum 1 (that has the ‘genuine’ time and jumps to the :60 time stamp after :59) is, after the time window is over, about one second ahead of the stratum 2. So approximately the instant when the stratum 1 server jumps from :59 to :60, the stratum 2 server jumps from :58 to :59, and at the next second tick, they will both jump to 02:00:00 and be in synch again. The same approach works with a negative leap second, which was never needed yet, however.

    The disadvantage of this method is that you have to know in advance when the leap second will happen, which requires tables that regularly have to be updated since it is fairly unpredictable in the long run when a leap second will be necessary. I don’t know why they didn’t simply use the ‘LI’ bit in the NTP protocol to determine when to start ‘smearing’ – at least the article doesn’t say they did:

    < http://www.networksorcery.com/enp/protocol/ntp.htm>

    Maybe 24 hours notification in advance did not seem long enough for the smear interval. I doubt it, because I would not really like the time to differ from the real time for more than a day.

    Best regards,

    Peter.

  • Yeah, there are some regularity requirements in some industries that the server clocks are within 100ms of UTC (or, at least, that’s how internal audit have interpreted the regulations where I work).

    Allowing the clock to drift by a second would normally be bad, but I
    guess it wouldn’t matter on a non-business day. (Well, non-business for 95% of the company where this matters – some areas were still open).

  • As far as I’ve been able to understand it, the problem had nothing to do with validity checks or other date handling code. The problem was simply a bug in the API provided by the Linux kernel for notification of leap seconds. The kernel messed up some internal data that led to futexes going nuts. The affected programs weren’t handling dates poorly, they were just threaded applications.

    Google’s approach was reliable by chance. They used a different kernel
    API to adjust the clock, and that one didn’t break futexes.

  • So it wasn’t anything special about java? I did find one one not-very-busy instance of a CentOS 6.x with a java application still running that did not appear to have a problem.

  • That’s not quite correct. The NTP protocol (as you mentioned later)
    actually indicates that the current day should include a leap second,
    the NTP server notifies the kernel that the day should include a leap second, and the kernel inserts the leap second at the end of the day by extending the duration of one of the system clock’s seconds.

    The “60” second doesn’t exist in NTP or in the POSIX system clock, both of which are counters from their respective epochs. The “60” second is present only in time representations that are converted from the system clock or NTP clock.

    http://www.eecis.udel.edu/~mills/leap.html

  • Only that java applications tend to be threaded, and threaded applications were the ones likely to be affected by the bug.

  • Sooo… Are the 6.x boxes that did not exhibit a problem yet still likely to have it if you start a threaded program or did it have to happen in the 1 second window?

  • As far as I know, it could still pop up. The futex handling in the kernel will be screwed up until the system reboots, or until the time is set using an API that wasn’t affected by the bug. That’s why one of the recommended fixes is just to:

    date -s “`date`”

  • Gordon Messmer wrote:

    Dumb question, but I haven’t followed this thread that closely – been busy at work – but why not
    $ service ntp stop
    $ ntpdate
    $ service ntp start
    ?
    mark

  • Today that might work, but would be slower than using “date”. On
    Saturday, I think that would have triggered the bug.

  • An event on an unpredictable schedule averaging 1.7 years since 1972
    doesn’t count as “scarce”?

    That’s the answer to Les’s outrage, too, by the way. Might as well expect the JRE to have code to deal with cosmic ray damage that gets by
    ECC, too.

  • Because that results in a call to adjtimex(2), which is also the syscall used by ntpd, which in turn is affected by the kernel bug.

    Calling date(1) instead uses the clock_settime(2) syscall, which isn’t affected.

    One isn’t implemented in terms of the other, for reasons that should be obvious from the manpages.

  • “Unpredictible” means you don’t know something is coming in time to test for what to expect from its effect. I don’t see how that applies here.

    Well, if there were a well known, long-standing API for that, and the time it was going to happen announced months ahead yes, I would expect it to be tested too. But, per the earlier discussion it is a kernel bug, not the JRE. I’d sort of expect java builds to have unit tests for their APIs.

  • Would have loved to know that then ;-)

    We have 2 8-node clusters that runs many java applications, and many java applications on seperate servers. I went nuts when all java running servers cam to 100% cpu all at once !

    The guy I spoke to at RedHat GSS at about 00:45 UTC baiscly told me to reboot the server, wich ended up rebooting our 2 clusters… Bad… But since computers all over the world crashed, our clients did understood that the problem was far lower than they beleived it…

LEAVE A COMMENT