NSX Controller Cluster Failure: What to expect?
In this blog post I will be detailing what to expect in the case of losing a controller cluster, in terms of:
- 1 controller cluster node.
- 2 controller cluster nodes.
- 3 controller cluster nodes.
It is highly unlikely that you will lose a controller cluster because of the cluster itself as this is very rare to happen, but you might lose a controller cluster because of vSphere related issues as vSphere is a service layer for NSX eventually and by vSphere issues I do mean compute, storage and networking or badly handled virtual machine operation.
Before I start I would like to shed some light on the following VMware Docs:
- NSX Controller Cluster Failures.
- Recover from an NSX Controller Failure.
- Understanding the NSX Controller Cluster Architecture.
Here we go:
- Losing a single controller cluster node:
- No effect here on the controller cluster and it will still be functioning as it should be.
- Losing two controller cluster nodes:
- Here, the cluster lost its quorum because quorum is based on node majority and we no longer have that.
- The cluster now becomes a read-only cluster and no updates are going to be written to the controllers.
- ARP suppression will still be active and you won’t see any ARP requests on VMs that are part of different logical switches.
- You will not be able to create any logical switches (you will get an error if you tried).
- If you are using dynamic routing and you have a DLR control VM, you will loosing the incoming routing udpates as the controller cluster will not be able to update the routing table on the ESXi hosts.
- Existing routes will remain cached and there is no actual timeout value on the ESXi hosts or cache drainage for that matter (to my knowledge).
- All existing virtual machines will still be able to communicate E/W and N/S.
- If you create new virtual machines they will be able to join a VNI (remember that an LS is a port group), and they will be reachable (didn’t notice any abnormal behaviour here) | Look at the comment from Nelson Reyes.
- Losing three controller cluster nodes:
- Everything when loosing two controller cluster nodes applies here as well except for the ARP suppression, here this is lost as well and you will be able to notice ARP requests on VMs that are residing on different logical switches.
The up above results were based on tests that I have done in my lab (NSX-v 6.3.5 and ESXi 6.5 Update 1) and if you have any additional information that you can add I will update this blog post accordingly.
Again this is pretty rare and most probably if a controller cluster node is totally lost then you should probably redeploy a new one, other than that you should open a support ticket with VMware GSS and let them handle it from there.
I hope this has been informative and thank you for your time,
(Abdullah)^2
Hello there,
I think that the step 8 on this post have a small consideration, which is, if the host where you want to connect a new VM already have a member on that VNI you will receive traffic on that VM without any problem, but if the host is not part of that VNI then the VM can’t communicates with anyone.
Hello Nelson,
Agreed, thank you for the head’s up, I’ve updated the blog post to point to this comment.
(Abdullah)^2
Hello there,
does this list account for CDO operations? There are plenty of sources which describe CDO, but I haven’t found one that would mention what happens if I have 2 control cluster nodes lost, thus 1 controller in this read-only mode, and at the same time CDO enabled (assuming 1 single-site cluster, no cross-vc nsx). Does CDO kick-in? Does it not? If it does, then when does it deactivate? When I have again at least 2 active controllers or only when I have full cluster OK again? If it does not kick-in, should I kill the remaining controller to have CDO kick-in before I hurry up to restore controller cluster? Or should I forget CDO under these circumstances and just hurry up for controller restoration?
Thank you.
Martin
Hello,
CDO is another protection mechanism, but you definitely should work on getting the controller cluster back, up and running as soon as possible.