CentOS 6.5 RHCS Fence Loops

Home » CentOS » CentOS 6.5 RHCS Fence Loops
CentOS 5 Comments

Hi Guys,

I’m using CentOS 6.5 as guest on RHEV and rhcs for cluster web environment. The environtment :
web1.example.com web2.example.com

When cluster being quorum, the web1 reboots by web2. When web2 is going up, web2 reboots by web1. Does anybody know how to solving this “fence loop” ?
master_wins=”1″ is not working properly, qdisk also. Below the cluster.conf, I re-create “fresh” cluster, but the fence loop is still exist.























Log : /var/log/messages Oct 29 07:34:04 web2 corosync[1182]: [QUORUM] Members[1]: 1
Oct 29 07:34:04 web2 corosync[1182]: [QUORUM] Members[1]: 1
Oct 29 07:34:08 web2 fenced[1242]: fence web3.cluster dev 0.0 agent fence_rhevm result: error from agent Oct 29 07:34:08 web2 fenced[1242]: fence web3.cluster dev 0.0 agent fence_rhevm result: error from agent Oct 29 07:34:08 web2 fenced[1242]: fence web3.cluster failed Oct 29 07:34:08 web2 fenced[1242]: fence web3.cluster failed Oct 29 07:34:12 web2 fenced[1242]: fence web3.cluster success Oct 29 07:34:12 web2 fenced[1242]: fence web3.cluster success Oct 29 07:34:12 web2 clvmd: Cluster LVM daemon started – connected to CMAN
Oct 29 07:34:12 web2 clvmd: Cluster LVM daemon started – connected to CMAN
Oct 29 07:34:12 web2 rgmanager[1790]: I am node #1
Oct 29 07:34:12 web2 rgmanager[1790]: I am node #1
Oct 29 07:34:12 web2 rgmanager[1790]: Resource Group Manager Starting Oct 29 07:34:12 web2 rgmanager[1790]: Resource Group Manager Starting

Thanks

5 thoughts on - CentOS 6.5 RHCS Fence Loops

  • Hi,

    Does anybody know how to solving this “fence loop” ?

    Logs shared are not sufficient to identify the cause of fence loop. I would suggest you to

    1. Disable cman – chkconfig cman off ( and rgmanager also if you wish ) –
    on both the nodes .
    2. Reboot both the nodes
    3. Once the machine is up, open two terminals
    4. Start cman manually on both the nodes
    5. share the behaviour and logs generated.

    Cheers,

  • Hello Dominic,

    Thanks for the response.

    when I start cman manually, web3 fenced by web2. Here the logs :
    web2 : /var/log/messages/

    Oct 29 13:15:25 web2 corosync[1493]: [MAIN ] Corosync Cluster Engine
    (‘1.4.1’): started and ready to provide service. Oct 29 13:15:25 web2 corosync[1493]: [MAIN ] Corosync Cluster Engine
    (‘1.4.1’): started and ready to provide service. Oct 29 13:15:25 web2 corosync[1493]: [MAIN ] Corosync built-in features:
    nss dbus rdma snmp Oct 29 13:15:25 web2 corosync[1493]: [MAIN ] Corosync built-in features:
    nss dbus rdma snmp Oct 29 13:15:25 web2 corosync[1493]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf Oct 29 13:15:25 web2 corosync[1493]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf Oct 29 13:15:25 web2 corosync[1493]: [MAIN ] Successfully parsed cman config Oct 29 13:15:25 web2 corosync[1493]: [MAIN ] Successfully parsed cman config Oct 29 13:15:25 web2 corosync[1493]: [TOTEM ] Initializing transport
    (UDP/IP Multicast). Oct 29 13:15:25 web2 corosync[1493]: [TOTEM ] Initializing transport
    (UDP/IP Multicast). Oct 29 13:15:25 web2 corosync[1493]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Oct 29 13:15:25 web2 corosync[1493]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Oct 29 13:15:26 web2 corosync[1493]: [TOTEM ] The network interface
    [10.32.6.153] is now up. Oct 29 13:15:26 web2 corosync[1493]: [TOTEM ] The network interface
    [10.32.6.153] is now up. Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] Using quorum provider quorum_cman Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] Using quorum provider quorum_cman Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync cluster quorum service v0.1
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync cluster quorum service v0.1
    Oct 29 13:15:26 web2 corosync[1493]: [CMAN ] CMAN 3.0.12.1 (built Sep 25
    2014 15:07:47) started Oct 29 13:15:26 web2 corosync[1493]: [CMAN ] CMAN 3.0.12.1 (built Sep 25
    2014 15:07:47) started Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync CMAN membership service 2.90
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync CMAN membership service 2.90
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    openais checkpoint service B.01.01
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    openais checkpoint service B.01.01
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync extended virtual synchrony service Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync extended virtual synchrony service Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync configuration service Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync configuration service Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync cluster closed process group service v1.01
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync cluster closed process group service v1.01
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync cluster config database access v1.01
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync cluster config database access v1.01
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync profile loading service Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync profile loading service Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] Using quorum provider quorum_cman Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] Using quorum provider quorum_cman Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync cluster quorum service v0.1
    Oct 29 13:15:26 web2 corosync[1493]: [SERV ] Service engine loaded:
    corosync cluster quorum service v0.1
    Oct 29 13:15:26 web2 corosync[1493]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Oct 29 13:15:26 web2 corosync[1493]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Oct 29 13:15:26 web2 corosync[1493]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 13:15:26 web2 corosync[1493]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 13:15:26 web2 corosync[1493]: [CMAN ] quorum regained, resuming activity Oct 29 13:15:26 web2 corosync[1493]: [CMAN ] quorum regained, resuming activity Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] This node is within the primary component and will provide service. Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] This node is within the primary component and will provide service. Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] Members[1]: 1
    Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] Members[1]: 1
    Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] Members[1]: 1
    Oct 29 13:15:26 web2 corosync[1493]: [QUORUM] Members[1]: 1
    Oct 29 13:15:26 web2 corosync[1493]: [CPG ] chosen downlist: sender r(0) ip(10.32.6.153) ; members(old:0 left:0)
    Oct 29 13:15:26 web2 corosync[1493]: [CPG ] chosen downlist: sender r(0) ip(10.32.6.153) ; members(old:0 left:0)
    Oct 29 13:15:26 web2 corosync[1493]: [MAIN ] Completed service synchronization, ready to provide service. Oct 29 13:15:26 web2 corosync[1493]: [MAIN ] Completed service synchronization, ready to provide service. Oct 29 13:15:30 web2 fenced[1548]: fenced 3.0.12.1 started Oct 29 13:15:30 web2 fenced[1548]: fenced 3.0.12.1 started Oct 29 13:15:30 web2 dlm_controld[1568]: dlm_controld 3.0.12.1 started Oct 29 13:15:30 web2 dlm_controld[1568]: dlm_controld 3.0.12.1 started Oct 29 13:15:30 web2 gfs_controld[1621]: gfs_controld 3.0.12.1 started Oct 29 13:15:30 web2 gfs_controld[1621]: gfs_controld 3.0.12.1 started Oct 29 13:16:21 web2 fenced[1548]: fencing node web3.cluster Oct 29 13:16:21 web2 fenced[1548]: fencing node web3.cluster Oct 29 13:16:24 web2 fenced[1548]: fence web3.cluster dev 0.0 agent fence_rhevm result: error from agent Oct 29 13:16:24 web2 fenced[1548]: fence web3.cluster dev 0.0 agent fence_rhevm result: error from agent Oct 29 13:16:24 web2 fenced[1548]: fence web3.cluster failed Oct 29 13:16:24 web2 fenced[1548]: fence web3.cluster failed Oct 29 13:16:27 web2 fenced[1548]: fencing node web3.cluster Oct 29 13:16:27 web2 fenced[1548]: fencing node web3.cluster Oct 29 13:16:29 web2 fenced[1548]: fence web3.cluster success Oct 29 13:16:29 web2 fenced[1548]: fence web3.cluster success

  • In 2-node clusters, never allow cman or rgmanager to start on boot. A
    node will reboot for two reasons; it was fenced or it is scheduled maintenance. In the former case, you want to review it before restoring it. In the later case, a human is there to start it already. This is good advice for 3+ clusters as well.

    As an aside, the default timeout to wait for the peer on start is 6
    seconds, which I find to be too short. I up it to 30 seconds with:

    As for the fence-on-start, it could be a network issue. Have you tried unicast instead of multicast? Try this:

    Slight comment;

    > When cluster being quorum,

    Nodes are always quorate in 2-node clusters.

    digimer

  • It didn’t see the other node on boot, gave up and fenced the peer, it seems. The fence call failed before it succeeded, another sign of a general network issue.

    As an aside, did you configure corosync.conf? If so, don’t. Let cman handle everything.

    Are you starting cman on both nodes at (close to) exactly the same time?

  • Hello Digimer,

    i’m already configured cluster.conf like your advice, but when start cman manually on web3 ( cman already stopped ), web2 fenced by web3. Here the log on web3 :
    Oct 29 14:38:42 web3 ricci[2557]: Executing ‘/usr/bin/virsh nodeinfo’
    Oct 29 14:38:42 web3 ricci[2557]: Executing ‘/usr/bin/virsh nodeinfo’
    Oct 29 14:38:42 web3 ricci[2559]: Executing
    ‘/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1604501608’
    Oct 29 14:38:42 web3 ricci[2559]: Executing
    ‘/usr/libexec/ricci/ricci-worker -f /var/lib/ricci/queue/1604501608’
    Oct 29 14:38:42 web3 modcluster: Updating cluster.conf Oct 29 14:38:42 web3 modcluster: Updating cluster.conf Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Corosync Cluster Engine
    (‘1.4.1’): started and ready to provide service. Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Corosync Cluster Engine
    (‘1.4.1’): started and ready to provide service. Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Corosync built-in features:
    nss dbus rdma snmp Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Corosync built-in features:
    nss dbus rdma snmp Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Successfully read config from /etc/cluster/cluster.conf Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Successfully parsed cman config Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Successfully parsed cman config Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] Initializing transport
    (UDP/IP Unicast). Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] Initializing transport
    (UDP/IP Unicast). Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] Initializing transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0). Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] The network interface
    [10.32.6.194] is now up. Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] The network interface
    [10.32.6.194] is now up. Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] Using quorum provider quorum_cman Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] Using quorum provider quorum_cman Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync cluster quorum service v0.1
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync cluster quorum service v0.1
    Oct 29 14:39:05 web3 corosync[2651]: [CMAN ] CMAN 3.0.12.1 (built Sep 25
    2014 15:07:47) started Oct 29 14:39:05 web3 corosync[2651]: [CMAN ] CMAN 3.0.12.1 (built Sep 25
    2014 15:07:47) started Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync CMAN membership service 2.90
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync CMAN membership service 2.90
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    openais checkpoint service B.01.01
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    openais checkpoint service B.01.01
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync extended virtual synchrony service Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync extended virtual synchrony service Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync configuration service Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync configuration service Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync cluster closed process group service v1.01
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync cluster closed process group service v1.01
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync cluster config database access v1.01
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync cluster config database access v1.01
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync profile loading service Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync profile loading service Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] Using quorum provider quorum_cman Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] Using quorum provider quorum_cman Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync cluster quorum service v0.1
    Oct 29 14:39:05 web3 corosync[2651]: [SERV ] Service engine loaded:
    corosync cluster quorum service v0.1
    Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Compatibility mode set to whitetank. Using V1 and V2 of the synchronization engine. Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] adding new UDPU member
    {10.32.6.153}
    Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] adding new UDPU member
    {10.32.6.153}
    Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] adding new UDPU member
    {10.32.6.194}
    Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] adding new UDPU member
    {10.32.6.194}
    Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 14:39:05 web3 corosync[2651]: [TOTEM ] A processor joined or left the membership and a new membership was formed. Oct 29 14:39:05 web3 corosync[2651]: [CMAN ] quorum regained, resuming activity Oct 29 14:39:05 web3 corosync[2651]: [CMAN ] quorum regained, resuming activity Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] This node is within the primary component and will provide service. Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] This node is within the primary component and will provide service. Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] Members[1]: 2
    Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] Members[1]: 2
    Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] Members[1]: 2
    Oct 29 14:39:05 web3 corosync[2651]: [QUORUM] Members[1]: 2
    Oct 29 14:39:05 web3 corosync[2651]: [CPG ] chosen downlist: sender r(0) ip(10.32.6.194) ; members(old:0 left:0)
    Oct 29 14:39:05 web3 corosync[2651]: [CPG ] chosen downlist: sender r(0) ip(10.32.6.194) ; members(old:0 left:0)
    Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Completed service synchronization, ready to provide service. Oct 29 14:39:05 web3 corosync[2651]: [MAIN ] Completed service synchronization, ready to provide service. Oct 29 14:39:09 web3 fenced[2708]: fenced 3.0.12.1 started Oct 29 14:39:09 web3 fenced[2708]: fenced 3.0.12.1 started Oct 29 14:39:09 web3 dlm_controld[2734]: dlm_controld 3.0.12.1 started Oct 29 14:39:09 web3 dlm_controld[2734]: dlm_controld 3.0.12.1 started Oct 29 14:39:09 web3 gfs_controld[2781]: gfs_controld 3.0.12.1 started Oct 29 14:39:09 web3 gfs_controld[2781]: gfs_controld 3.0.12.1 started Oct 29 14:40:24 web3 fenced[2708]: fencing node web2.cluster Oct 29 14:40:24 web3 fenced[2708]: fencing node web2.cluster Oct 29 14:40:29 web3 fenced[2708]: fence web2.cluster success Oct 29 14:40:29 web3 fenced[2708]: fence web2.cluster success

    I’m not configure corosync.conf cluster.conf :























    Thanks