Leap Second And CentOS

Home » CentOS » Leap Second And CentOS
CentOS 28 Comments

Hi.

We have another leap second coming. Have past bugs with CentOS and leap seconds (specifically high CPU spikes) been resolved? Should we be worried?

-G

28 thoughts on - Leap Second And CentOS

  • It seems to boil down to: https://bugzilla.redhat.com/show_bug.cgi?idG9765 which is closed fixed. The private bugs are “pay attention” tickets, but some do reference KB Article “Leap Seconds in Red Hat Enterprise Linux”.

    A summary of Leap Seconds in Red Hat Enterprise Linux – https://access.redhat.com/articles/15145

    Issue:

    6 different ways of saying “Will my system work?” .

    Environment:

    EL 4-7

    Resolution:

    For EL6 see:
    * https://access.redhat.com/knowledge/solutions/154713
    * https://access.redhat.com/knowledge/solutions/154793
    * https://access.redhat.com/knowledge/solutions/173693
    * https://access.redhat.com/knowledge/articles/199563

    Otherwise if you run NTP resolution A else B

    Resolution A:

    NTP logging may crash EL 4/5, update your system.

    EL4 see http://rhn.redhat.com/errata/RHSA-2009-1024.html EL5 see http://rhn.redhat.com/errata/RHSA-2009-1243.html EL6/7 not affected, but EL6 see https://access.redhat.com/site/solutions/154793 CPU usage sucks after leap second*

    [*:side bar: see http://rhn.redhat.com/errata/RHBA-2012-1199.html for the patch or do something like “date $(date +someformatthatworks)”]

    PPC and IA64 arches will self destruct and should not use NTP

    Resolution B:

    Your time will be wrong and you should be happy. A new tzdata will come out see bugs:
    EL4: https://bugzilla.redhat.com/show_bug.cgi?id81975
    EL5: https://bugzilla.redhat.com/show_bug.cgi?id81933
    EL6: https://bugzilla.redhat.com/show_bug.cgi?id80536
    EL7: https://bugzilla.redhat.com/show_bug.cgi?id81970

    Root Cause:

    https://what-if.xkcd.com/26/

  • […lots of stuff…]

    Can you consolidate this to:
    ‘if you have updated your kernel and rebooted later than Sept. 2012
    you should have the fix’?


  • Yes; I thought that was assumed (or obvious), when this whole topic came up again.

    But I was responding to the “cannot see” comment, so I read it all and posted a 1st grade book report on it. :)

    -Jason

  • Fascinating – describes what’s happening but no mention of how we can rest assured that all will be well…. As I ponder it, I recognise that most of our systems are constantly calculating date/time values based upon the epoch – the number of seconds since a particular date/time, all these calculations need to be cognisant of these leap seconds, so its not just the ntp daemon, although that will be most immediately impacted, the effects of this need to be enshrined in code algorithms forever (well a very long time).

  • The overall time calculations weren’t really the issue last time around. The problem was with sub-second sleeps and the thread scheduler being confused and spinning when ntp inserted an extra second in the clock. Any other way of resetting the clock fixed it.
    (e.g. date -s “`date`”). It was a kernel bug and is theoretically fixed now.

    But I agree that those open bugs on the tzdata package aren’t all that helpful except to show that someone is thinking about it.

  • Unix and ntp handle leap seconds a bit differently. Unix time increases during the leap second and drops back a second after. Ntp freezes time during the leap second. OS kernels may do either or neither.

  • Does anyone have a succinct summary of how to prove to management-types that a given linux box won’t have a problem with the leap second? Like kernel > some_version, tzdata > some_version, tzdata-java > some_version?

  • Once upon a time, Les Mikesell said:

    Only way to “prove” it is to set up a test and try it. AFAIK there are no known issues with an up-to-date system, but that was also true at the last couple of leap seconds (the issues that happened were previously unknown).

    There are a couple of ways to test:

    – If you don’t need to “prove” NTP goodness, you can set up a
    free-running system with no NTP client, set the time to just before
    the leap second, and then use the adjtimex command (looks like this
    isn’t in RHEL/CentOS/EPEL so you would need to build it, like from the
    Fedora package) to set the leap flag. Then just watch your system
    through the leap second.

    – If you also need to “prove” NTP, you’ll have to set up a second system
    to be your NTP server. Set it to local mode with no outside servers,
    add the current leapseconds file, and set it’s clock to a little
    before the leap second. Sync your test server to that clock, then
    wait for the leap second.

    The issue (from IIRC 2009?) I ran into with a leap second only happened when the kernel was under load (race condition on console lock when printing the “leap second added” message). The most recent leap second issue had to do with timers not triggering in the expected way (can’t remember if that was kernel, or just applications/libraries not handling a kernel change).

  • I don’t think I need to ‘prove’ that computer programs do repeatable things. I just want to know the version numbers that need to be installed – something relatively easy to check.

    Yeah, but you probably would have said that before the 2012 instance too… And what I really want to know is how ‘out-of-date’ a system can be.

    Now we know the issues, and hopefully someone had done the simulation tests. I just want to know the specific kernel and package versions that have the fixes. But none of the links I’ve found discussing the issues boil it down to something a non-geek would want to see.

  • Les Mikesell wrote:

    Two other thoughts: first, that it worked perfectly fine the last leap second, and second, that ntpd, according to the manpage, can and will adjust for seconds of difference with no problem at all, since that’s it’s job.

    mark

  • Errr, no. It did _not_ work fine in the last leap second. If you run threaded applications (including, but not exclusively, java) or applications that called usleep the kernel would spin with 100% CPU
    use until you reset the date with some means other than ntp. How could you have missed that:
    http://www.wired.com/2012/07/leap-second-bug-wreaks-havoc-with-java-linux/.

    Every other sysadmin in the world got calls in the middle of the night to fix their servers.

  • No, it was _not_ java that failed. The kernel was spinning instead of scheduling threads. Any threaded application would have triggered the kernel bug – or a usleep() call from a non-threaded application. By the time I got the call I was able to google the fix about resetting the date, but the guys who manage some SuSE systems started earlier and ended up rebooting some of them – and they don’t run java applications.

  • Once upon a time, Les Mikesell said:

    No, we know the issue that broke last time (2012), and a different issue that broke the time before that (2008) (they were different problems). We don’t know any issues that may happen this time, unless you think no bugs have been introduced since the last leap second (obviously hindsight tells us there were between 2008 and 2012).

    Before the 2012 leap second, I ran tests to make sure the 2008 issue had been fixed, and it had. However, apparently nobody else ran their current setups through tests (maybe also hoping somebody else had done it), so there was a new issue. I haven’t actually checked to see that the 2008 issue has remained fixed (it should have, since the code had been changed to move away from that lock all together). My setup wasn’t hit by the 2012 issue, so I don’t have a simple test for that.

    So again, if you want to make sure there’s no new issue, you’ll have to set up a test yourself. I doubt the 2008 or 2012 issues will happen again, but there’s plenty of room for new issues.

  • So are you saying that you think no one upstream has done any testing yet? Or that I should have better resources for testing than they do?
    I was hoping things weren’t really that bad and that I just hadn’t found the simple summary of results yet.

  • Once upon a time, Les Mikesell said:

    Like I said, probably someone that had an issue in 2012 has tested for the 2012 issue, so that probably won’t re-occur. But that doesn’t mean that someone has tested every piece of software in every combination in use.

    Again, using the 2012 leap second as an example, I (and I expect others)
    had experienced an issue in 2008, so I ran tests for that issue. I
    didn’t even think about thread scheduling being a problem (and my servers weren’t hit by that anyway), so I didn’t test for that, nor did I do a “full up” test like I described initially.

    So, it is possible that everything will be fine (there’s been more attention to leap second cases after the 2012 issue had wider impact than the 2008 issue). It is also possible that some _new_ type of issue has been introduced in the last 2.5 years that won’t appear until this leap second, but if nobody tests for it, we won’t know until the clock ticks 2015-06-30 23:59:60.

    Short answer: last time it was threaded stuff like Java, the time before it was systems under heavy kernel loads. Who knows, this time Postfix could hang, or MySQL could corrupt databases, or something else. Probably nothing will happen, but if you want a “cover your ass” report, I don’t think anybody has done that.

  • I’m not looking for a research project on how to prove that the last bug has been found or not. And I’m not particularly concerned about application-level bugs. Every time a second rolls over we take a chance of hitting a new previously unknown bug. We’re all taking that chance.

    I just want the package revisions for at least the kernel and tzdata*
    files and anything else where previously-found bugs related to the leap second have been fixed. What I want to know (and be able to describe concisely to a non-geek person) is that on a particular machine either that the known/expected bugs have been fixed, or that they haven’t and we need to schedule a reboot. And it seems like something everyone else using a distribution would want to know as well, at least for machines where scheduling a reboot is no-trivial.

  • https://access.redhat.com/articles/15145
    https://rhn.redhat.com/errata/RHSA-2013-0496.html

    Contrary to your previous assertion, in 2012, it was not the kernel that consumed CPU cycles. That problem was seen in user space. The problem was fixed by changing the kernel’s implementation of leap second handling, but the reason that you are being told that testing your applications is the only way to verify that there is not a problem is that these problems aren’t confined to the kernel and tzdata packages.

  • Once upon a time, Les Mikesell said:

    Basically, POSIX time doesn’t really handle leap seconds. In theory, the timeinfo struct can count to 60 (even 61) seconds in a minute.

    However, the base time_t is specified as days of exactly 86,400 seconds. The Linux kernel (and IIRC most other Unix systems) just tick the same second twice; this June, the time() function will return 1435708799 for two seconds on the wall clock, and gettimeofday() will count tv_usec from 0 to 999, then back to 0, without changing tv_sec.

    So, there’s a hack for things that really want to know leap seconds. It is done in the timezone data files; they know the offset from POSIX to UTC (based on all the leap seconds inserted since the start of the POSIX
    epoch, 1970-01-01) and report time that way.

    If your kernel never handled leap seconds, and was set to UTC seconds since 1970-01-01 instead of POSIX seconds, then you could use the
    “right” timezone files to see the current time. However, you’d be out of step with all the rest of the Internet for anything that uses POSIX
    seconds (fileservers for example), and always think the clock was slow
    (plus you’d have to run a custom copy of NTP to not try to “fix” the clock).

  • Still way tl;dnr material. Doesn’t anyone have a list of the oldest kernel version for each CentOS version you could be running and still avoid known problems?

  • The best answer to your question is “the latest version”, since previous versions all have known issues of one kind or another.

    It’s not a great idea to run outdated CentOS systems with known bugs of any kind.

  • I can’t argue with that (then again, you were running that buggy code before and happy with it), but having to reboot frequently is not ideal either, particularly on machines where scheduling downtime is a fairly involved process. I’m looking for the compromise with the least pain involved.

  • Hi Les,

    https://access.redhat.com/labs/leapsecond/leap_vulnerability.sh If you don’t have a subscription then the key bits from the script are:
    # RHEL 4 needs to be after -89
    # RHEL 5 needs to be after -164
    # RHEL 6 Affected Versions
    # 6 GA: All Versions
    # 6.1: Versions before -131.30.2
    # 6.2: Versions before -220.25.1
    # 6.3: Versions before -279.5.2

    and that the tzdata should be from 2015

    Tris

    *************************************************************
    This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you have received this email in error please notify postmaster@bgfl.org

    The views expressed within this email are those of the individual, and not necessarily those of the organisation
    *************************************************************

  • Thank you. That may save dealing with at least a few change request forms and scheduling procedures.