Keepalived – Spurious Failovers

Home » CentOS » Keepalived – Spurious Failovers
CentOS 5 Comments

Hello,

We are using CentOS 6.6 and keepalived 1.2.13 on two servers for failover, no load-balancing. Failover is governed by the NIC being present, and the Apache and Tomcat processes being present. Both servers are configured as ‘EQUAL’ (not master/backup). An initial priority of
100 is set, and if a process or NIC fails, then this is reduced by 60 –
causing a lower priority to be seen and failover to take place. Generally this works well. If we stop the network or one of the processes, this is logged (to /var/log/messages) and failover happens within a few seconds.

However, we have had failovers occur during the night several times. It happened last night, and the night before. Nothing was logged in the messages file about the NIC being down, or the Apache/Tomcat processes being unavailable. Nothing was logged by the Apache or Tomcat processes in their own log files. The failovers have happened at 03:56 on both nights.

The most obvious suspect causing this would be some nighttime process such as log rotation or automatic updates. However, I can see nothing obvious occurring during the night that would cause the keepalived virtual interface to failover.

The messages log file typically shows:

On the previous master, now slave server…
==========================Nov 12 03:56:40 bill Keepalived_vrrp[27279]: VRRP_Instance(Shib_srvrs)
Transition to MASTER STATE
Nov 12 03:56:43 bill Keepalived_vrrp[27279]: VRRP_Instance(Shib_srvrs)
Entering MASTER STATE
Nov 12 03:56:43 bill Keepalived_vrrp[27279]: VRRP_Instance(Shib_srvrs)
setting protocol VIPs. Nov 12 03:56:43 bill Keepalived_vrrp[27279]: VRRP_Instance(Shib_srvrs)
Sending gratuitous ARPs on eth0 for xxx.xxx.xxx.xxx Nov 12 03:56:48 bill Keepalived_vrrp[27279]: VRRP_Instance(Shib_srvrs)
Sending gratuitous ARPs on eth0 for xxx.xxx.xxx.xxx Nov 12 03:56:51 bill Keepalived_vrrp[27279]: VRRP_Instance(Shib_srvrs)
Received higher prio advert Nov 12 03:56:51 bill Keepalived_vrrp[27279]: VRRP_Instance(Shib_srvrs)
Entering BACKUP STATE
Nov 12 03:56:51 bill Keepalived_vrrp[27279]: VRRP_Instance(Shib_srvrs)
removing protocol VIPs.
=========================
On the previous slave, now master server, there is nothing logged at (or around) this time at all.

As the previous master log shows it ‘Received higher prio advert’. But that implies that the priority on the server is lower, and no indication why.

Has anyone seen this themselves? Or have any idea why it may be occurring? As said, some nighttime process seems to be the cause, but I
cannot think or find anything that would cause it.

Thanks,

John.

5 thoughts on - Keepalived – Spurious Failovers

  • Yes. Nothing obvious that would cause a problem to apache/tomcat or the network.

    They are both virtual servers – so no UPS. Failover communication is over the network.

    John.

  • +1 to your logrotate thought; I’d dig deeper there.

    check /var/lib/logrotate.status; see if it doesn’t match up with days the failover happens, that different httpd logs are rotating.

    —–Original Message—

  • No, no snapshots are taken. As said this is a spurious event which has happened at 03:56 for the past two nights. However, we ran for a few days before then with no problems. Before that it happened at something like 5AM. It does not happen every night, nor at the same time
    (usually).

    I have set up a couple of cronjobs to check ifconfig and ping the interface every few seconds. I also have a job that will monitor the main interface for VRRP traffic since that should show what the priority value is when a server claims to have received a higher priority from another server.

    John.

  • Given that failover only occurs if Apache, Tomcat or the NIC fail, I
    can’t find anything in log rotation that could cause this effect. For failover to occur the Apache/Tomcat process must be non-existent (in our case keepalived checks for them using pgrep). We have secondary monitoring of these processes (Xymon using checks of ‘ps’), and that shows no such failure. Simply logging into the servers and running ps shows that they are running. I would hope that something would be logged by either process in the appropriate log file, but nothing is seen. Of course it could be something dire that simply kills the process dead, but again we do not see that at all (ps shows they are present). So that leaves the NIC. Again, I cannot think of any process (day or night) that would cause the NIC to fail (or restart) – that would be a serious problem. Secondly, keepalived should log the fact and put itself into a FAULT state. I tested this on a test server, and it worked as described. We, however, see no such fault state or log messages on our live servers.

    So, I am very much stumped as to the problem. I’m hoping that if keepalived fails over tonight, then the cron jobs I have set up may give a clue.

    John.