Microsoft Failover Cluster: Nonidentical Nodes Consequence
So recently I have been assigned to work on Microsoft Server 2012 R2 RDS farm project with RemoteFX, the farm is to be placed on top of a Windows Server 2012 R2 Hyper-V cluster.
Before creating the virtual machines we had two things to do:
- Add additional memory to the Hyper-V hosts.
- Add graphic cards to the nodes that will be hosting the RDS farm (the farm consists of 3 HP DL380 GEN8 hosts and 3 HP DL180 GEN8 where the graphic cards will be installed on the HP DL380 hosts).
We picked up one of the DL180 hosts to upgrade its memory, we migrated all the virtual machines by pausing the host and draining its roles then shutting it down gracefully.
Installed the additional memory and powered on the host, validated that all components are up and running including all networks (LAN, heartbeat, virtual switch attached) and all was good and based on that we resumed the host and got all the VMs that were running on it migrated back.
A couple of minutes later a major power-off on 50% of the virtual machines was noticed and it was all red and _magical_:
- Some of the machines wouldn’t power on.
- Other machines refused to migrate to different hosts.
- Some powered on but rather than booting into the OS the VM gave an error on boot screen.
To describe the situation a little but further I would say that I looked very much like this!
Bottom line, the CSVs (Cluster Shared Volumes) were going offline, a couple of tests showed that they were going offline randomly as well. In the event log we couldn’t comprehend much nor the cluster log reflected anything as well.
To stabilize the environment we evicted the node which we upgraded (based on the thumb rule of what did we change) and suddenly we regained cluster stability. So! What happened? Why did a memory upgrade triggered a cluster catastrophe? To my knowledge usually a cluster node cranks during a validation check (Which is an awesome feature by the way). The other speculation would be that everything was working before without any issues, how did it come to this? If you think I have an answer to that! Guess again!
So we decided to go through a full cluster scale comparison to see what are the differences between the nodes, so that to be able to identify why a memory upgrade caused all this mayhem, now we know that it is a requirement for all failover cluster nodes to have identical requirement to a certain extent.
We found that:
- Cluster Service Node5 at 6.3.9600.16469 with 52 Windows Updates
- Cluster Service Node2 at 6.3.9600.16469 with 49 Windows Updates
- Cluster Service Node1 at 6.3.9600.18523 with 269 Windows Updates
- Cluster Service Node3 at 6.3.9600.18523 with 247 windows Updates
- Cluster Service Node4 at 6.3.9600.18523 with 248 Windows Updates
You’d notice that nodes 5 and 2 are way back in terms of Windows Updates and definitely the updates affected the cluster service and a lot of others as well and this by itself is a major inconsistency. Again why was the cluster operational? The only thing which comes to my mind is that the cluster lost it due to the hardware change and the effect was more of a split brain with a lost of quorum at the same time.
To validate that my theory is correct, we :
- Re-installed Windows Server 2012 R2 from scratch.
- Applied the updates to match the patch level of nodes 1, 3 and 4.
- Rejoined the node to the cluster.
Migrated a couple of workloads to the node and waited on it for around an hour with constant monitoring and all was good and stable, after fixing the two nodes I disabled all Windows updates on the all nodes so that the updates (I know it is not recommended but sometimes it is better when an environment doesn’t have proper controls).
I have been working with Microsoft failover clusters for sometime now and this is the first time I notice such a behavior on a cluster in terms of in-stability, I hope this blog helps you gain more understanding to the importance of keeping a Microsoft Failover Cluster nodes identical at all time.
Although doing this was not part of our project but I definitely learned to do a full end-t0-end checks on failover clusters before evening looking at them %), oh! and one more thing! Microsoft PLEASE PLEASE I would very much like you to change the name of Failover Cluster utility something else as it doesn’t give you much hope when you launch the start menu and type ‘fail’ in order to get to the management console ;-).