KVM HA

Home » CentOS » KVM HA
CentOS 22 Comments

–AkmLfxeV3oQ1r7Ma5VS6o3AHqvkjeMvtv Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

Hi,

I have two KVM hosts (CentOS 7) and would like them to operate as High Availability servers, automatically migrating guests when one of the hosts goes down.

My question is: Is this even possible? All the documentation for HA that I’ve found appears to not do this. Am I missing something?

My configuration so fare includes:

* SAN Storage Volumes for raw device mappings for guest vms (single volume per guest).
* multipathing of iSCSI and Infiniband paths to raw devices
* live migration of guests works
* a cluster configuration (pcs, corosync, pacemaker)

Currently when I migrate a guest, I can all too easily start it up on both hosts! There must be some way to fence these off but I’m just not sure how to do this.

Any help is appreciated.

Kind regards, Tom

22 thoughts on - KVM HA

  • Very possible. It’s all I’ve done for years now.

    https://alteeve.ca/w/AN!Cluster_Tutorial_2

    That’s for EL 6, but the basic concepts port perfectly. In EL7, just change out cman + rgmanager for pacemaker. The commands change, but the concepts don’t. Also, we use DRBD but you can conceptually swap that for
    “SAN” and the logic is the same (though I would argue that a SAN is less reliable).

    There is an active mailing list for HA clustering, too:

    http://clusterlabs.org/mailman/listinfo/users

    Fencing, exactly.

    What we do is create a small /shared/definitions (on gfs2) to host the VM XML definitions and then undefine the VMs from the nodes. This makes the servers disappear on non-cluster aware tools, like virsh/virt-manager. Pacemaker can still start the servers just fine and pacemaker, with fencing, makes sure that the server is only ever running on one node at a time.

    We also have an active freenode IRC channel; #clusterlabs. Stop on by and say hello. :)

  • You can use oVirt for that (www.ovirt.org). For that small number of hosts, you would probably want to use the
    “hosted engine” architecture to co-locate the management engine on the same hypervisor hosts. It is included by the CentOS virtualization SIG, so on CentOS it is just a couple of ‘yum install’s away…

    HTH,

  • When an UNCLEAN SHUDWON happens or ifdown eth0 in node1 , can OVIRT
    migrate VMs from node1 to node2?

    in that case, Is power management such as ILO needed?

  • –9GFXOkSgCtiXd0BgjxisLuhJqxpvFVsQp Content-Type: text/plain; charset=gbk Content-Transfer-Encoding: quoted-printable

    Hi Digimer,

    Thanks for your reply.

    In what way is the SAN method less reliable? Am I going to get into a world of trouble going that way?

    I’ve had a brief look at the web-site. Lots of good info there. Thanks!

    That sounds simple enough :-P. Although, I wanted to be able to easily open VM Consoles which I do currently through virt-manager. I also use virsh for all kinds of ad-hoc management. Is there an easy way to still have my cake and eat it? We also have a number of Windows VM’s. Remote desktop is great but sometimes you just have to have a console.

    Will do. I have a bit of reading now to catch up but I’m sure I’ll have a few more questions before long.

    Kind regards, Tom

    –9GFXOkSgCtiXd0BgjxisLuhJqxpvFVsQp

  • In the HA space, there should be no single point of failure. A SAN, for all of it’s redundancies, is a single thing. Google for tales of bad SAN
    firmware upgrades to get an idea of what I am talking about.

    We’ve found that using DRBD and build clusters in pairs to be a far more resilient design. First, you don’t put all you eggs in one basket, as it were. So if you have multiple failures and lose a cluster, you lose one pair and the servers it was hosting. Very bad, but less so that losing the storage for all your systems.

    Consider this case that happened a couple of years ago;

    We had a client, through no malicious intent and misunderstanding of how
    “hot swap” worked, walk up to a machine and start ejecting drives. We got in touch and stopped him in very short order, but the damage was done and the node’s array was hosed. Certainly not a failure scenario we had ever considered.

    DRBD (which is sort of like “RAID 1 over a network”) simply market the local storage as Diskless and routed all read/writes to the good peer. The hosted VM servers (and the software underpinning them) kept working just fine. We lost the ability to live migrate because we couldn’t read from the local disk anymore, but the client continued to operate for about an hour until we could schedule a controlled reboot to move the servers without interrupting production.

    In short, to do HA right, you have to be able to look at every piece of you stack and say “what happens if this goes away?” and design it so that the answer is “we’ll recover”.

    For clients who need the performance of SANs (go big enough and the caching and whatnot of a SAN is superior), we then recommend two SANs and connect each one to a node and then treat them from there up as DAS.

    Clusterlabs is now the umbrella for a collection of different open source HA projects, so it will continue to grow as time goes on.

    We use virt-manager, too. It’s just fine. Virsh also works just fine. The only real difference is that once the server shuts off, it
    “vanishes” from those tools. I would say about 75%+ of our clients run some flavour of windows on our systems and both access and performance is just fine.

    Happy to help. If you stop by and don’t get a reply, please idle. Folks there span all timezones but are generally good about reading scroll-back and replying.

  • I can’t speak to ovirt (it’s more of a cloud platform than an HA one), but in HA in general, this is how it works…

    Say node1 is hosting vm-A. Node1 stops responding for some reason (maybe it’s hung, maybe it’s running by lost net, maybe it’s a pile of flaming rubble, you don’t know). Within a moment, the other cluster node(s) will declare it lost and initiate fencing.

    Typically “fencing” means “shut the target off over IPMI (iRMC, iLO, DRAC, RSA, etc). However, lets assume that the node lost all power
    (we’ve seen this with voltage regulators failing on the mainboard, shorted cable harnesses, etc). In that case, the IPMI BMC will fail as well so this method of fencing will fail.

    The cluster can’t assume that “no response from fence device A == dead node”. All you know is that you still don’t know what state the peer is in. To make assumption and boot vm-A now would be potentially disastrous. So instead, the cluster blocks and retries the fencing indefinitely, leaving things hung. The logic being that, as bad as it is to hang, it is better than risking a split-brain/corruption.

    What we do to mitigate this, and pacemaker supports this just fine, is add a second layer for fencing. We do this with a pair of switched PDUs. So say that node1’s first PSU is plugged into PDU 1, Outlet 1 and its second PSU is plugged into PDU 2, Outlet 1. Now, instead of blocking after IPMI fails, it instead moves on and turns off the power to those two outlets. Being that the PDUs are totally external, they should be up. So in this case, now we can say “yes, node1 is gone” and safely boot vm-A on node2.

    Make sense?

  • Yep.

    It needs a way to ensure the host is down to prevent storage corruption, so yeah.

    If you have some other way to determine that all the VMs on the failed host are down, you can use the API/CLI to to tell it to consider the host as if it had already been fenced out and bring the VMs back up on the 2nd host.

  • in addition to power fencing as described by others, you can also fence at the ethernet switch layer, where you disable the switch port(s) that the dead host is on. this of course requires managed switches that your cluster management software can talk to. if you’re using dedicated networking for ISCSI (often done for high performance), you can just disable that port.

  • This is called “fabric fencing” and was originally the only supported option in the very early days of HA. It has fallen out of favour for several reasons, but it does still work fine. The main issues is that it leaves the node in an unclean state. If an admin (out of ignorance or panic) reconnects the node, all hell can break lose. So generally power cycling is much safer.

  • Once upon a time, John R Pierce said:

    On boot, the cluster software assumes it is “wrong” and doesn’t connect to any resources until it can verify state.

    If the node is just disconnected and left running, and later reconnected, it can try to write out (now old/incorrect) data to the storage, corrupting things.

    Speaking of shared storage, another fencing option is SCSI reservations. It can be terribly finicky, but it can be useful.

  • Close.

    The cluster software and any hosted services aren’t running. It’s not that they think they’re wrong, they just have no existing state so they won’t try to touch anything without first ensuring it is safe to do so.

    SCSI reservations, and anything that blocks access is technically OK. However, I stand by the recommendation to power cycle lost nodes. It’s by far the safest (and easiest) approach. I know this goes against the grain of sysadmins to yank power, but in an HA setup, nodes should be disposable and replaceable. The nodes are not important, the hosted services are.

  • Once upon a time, Digimer said:

    Well, I was being short; what I meant was, in HA, if you aren’t known to be right, you are wrong, and you do nothing.

    One advantage SCSI reservations have is that if you can access the storage, you can lock out everybody else. It doesn’t require access to a switch, management card, etc. (that may have its own problems). If you can access the storage, you own it, if you can’t, you don’t. Putting a lock directly on the actual shared resource can be the safest path (if you can’t access it, you can’t screw it up).

    I agree that rebooting a failed node is also good, just pointing out that putting the lock directly on the shared resource is also good.

  • Ah, yes, exactly right.

    The SCSI reservation protects shared storage only, which is my main concern. A lot of folks think that fencing is only needed for storage, when it is needed for all HA’ed services. If you know what you’re doing though, particularly if you combine it with watchdog based fencing like fence_sanlock, you can be in good shape (if very very slow fencing times are OK for you).

    In the end though, I personally always use IPMI as the primary fence method with a pair of switched PDUs as my backup method. Brutal, Simple and highly effective. :P

  • of course, the really tricky problem is implementing an ISCSI storage infrastructure thats fully redundant and has no single point of failure. this requires the redundant storage controllers to have shared write-back cache, fully redundant networking, etc. The fiberchannel SAN folks had all this down pat 20 years ago, but at an astronomical price point.

    The more complex this stuff gets, the more points of potential failure you introduce.

  • Digimer wrote:


    Question: when y’all are saying “reconnect”, is this different from stopping the h/a services, reconnecting to the network, and then starting the services (which would let you avoid a reboot)?

    mark

  • Once upon a time, John R Pierce said:

    Yep. I inherited a home-brew iSCSI SAN with two CentOS servers and a Dell MD3000 SAS storage array. The servers run as a cluster, but you don’t get the benefits of specialized hardware.

    We also have installed a few Dell EqualLogic iSCSI SANs. They do have the specialized hardware, like battery-backed caches and special interconnects between the controllers. They run active/standby, so they can do “port mirroring” between controllers (where if port 1 on the active controller loses link, but port 1 on the standby controller is still up, the active controller can keep talking through the standby controller’s port).

    I like the “build it yourself” approach for lots of things (sometimes too many :) ), but IMHO you just can’t reach the same level of HA and performance as a dedicated SAN.

  • Or use DRBD. That’s what we do for our shared storage backing our VMs and shared FS. Works like a charm.

  • Expecting a lost node to behave in any predictable manner is not allowed in HA. In theory, with fabric fencing, that is exactly how you could recover (stop all HA software, reconnect, start), but even then a reboot is highly recommended before reconnecting.

  • “(The node may become as synonymous with the service there’s no redundancy, but that’s a bug, not a feature.)”

    I am so stealing the hell out of that line. <3

  • How about trying commercial RHEV?

    Eero
    22.6.2016 8.02 ap. “Tom Robinson” kirjoitti: