Possibilities of VMware HA – VM Component Protection

While at VMworld 2012 I attended a session called “vSphere HA and Datastore Access Outages – Current – Capabilities Deep-Dive and Tech Preview” also known as INF-BCO2807. The presentation was put together by Smriti Desai (@smriti_desai) and Keith Farkas, both from VMware. If you haven’t seen this session and have access to them online, you really should take a look, it’s excellent.

The first part of the session went through current HA capabilities, to include the overall HA architecture, as well as the particular HA workflows and responses. The latter of the presentation covered futures, namely VM Component Protection. VM Component Protection would/will change how ESXi deals with datastore outages, particularly relating to Permanent Device Loss (PDL) and All Paths Down (APD) scenarios.

Of course, this is a technical preview and none of this functionality is guaranteed to be implemented in future releases of vSphere.

Here is how VM component protection could work should it be released.

PDL Protection

When a datastore becomes inaccessible due to a PDL event (target is responding, but the LUN or device isn’t) VM component protection will respond on a per-VM basis:

  • Take no action
  • Initiate HA failover

APD Protection

Just as with PDL protection. when a datastore becomes inaccessible due to APD (can’t contact the target on any path) VM component protection will respond on a per-VM response basis:

  • Take no action
  • Initiate HA failover

APD protection is more involved with the steps it takes prior to implementing the configured response:

  1. Once the datastore becomes inaccessible APD is declared only after a specified timeout has been exceeded. In vSphere 5.1 the APD timeout default is 140 seconds, and can be changed with the Misc.APDTimeout advanced setting
  2. Once APD is declared you can have another optional delay
  3. Once the optional delay has been exceeded (if you set one) VM component protection moves on to determining the per-vm response
  4. If the response is “Take no action” then the workflow ends
  5. If the response is “Initiate HA failover” then VM component protection determines if it can reserve enough capacity to restart the VM (why shut down a VM if you can’t guarantee a restart because the cluster has insufficient resources?):
    • If capacity can be reserved, failover is initiated
    • If capacity cannot be reserved it continues to try and reserve that capacity until it is able to, or the APD condition is resolved

I’ve paraphrased the above workflows based on Smriti’s and Keith’s session, and again, VMware has not committed to these enhancements. Keith demoed this live during his session, and it’s pretty awesome. I truly hope VMware sees the benefit in this, ESPECIALLY in stretched cluster designs and implements it in the next major release.

Comments 1

  1. Pingback: Solving All Paths Down (APD) when using EMC VPLEX » ValCo Labs

Leave a Reply

Your email address will not be published. Required fields are marked *