NSX Controller Cluster Failure: What to expect?
In this blog post I will be detailing what to expect in the case of losing a controller cluster, in terms of:
- 1 controller cluster node.
- 2 controller cluster nodes.
- 3 controller cluster nodes.
It is highly unlikely that you will lose a controller cluster because of the cluster itself as this is very rare to happen, but you might lose a controller cluster because of vSphere related issues as vSphere is a service layer for NSX eventually and by vSphere issues I do mean compute, storage and networking or badly handled virtual machine operation.
Before I start I would like to shed some light on the following VMware Docs:
- NSX Controller Cluster Failures.
- Recover from an NSX Controller Failure.
- Understanding the NSX Controller Cluster Architecture.
Here we go:
- Losing a single controller cluster node:
- No effect here on the controller cluster and it will still be functioning as it should be.
- Losing two controller cluster nodes:
- Here, the cluster lost its quorum because quorum is based on node majority and we no longer have that.
- The cluster now becomes a read-only cluster and no updates are going to be written to the controllers.
- ARP suppression will still be active and you won’t see any ARP requests on VMs that are part of different logical switches.
- You will not be able to create any logical switches (you will get an error if you tried).
- If you are using dynamic routing and you have a DLR control VM, you will loosing the incoming routing udpates as the controller cluster will not be able to update the routing table on the ESXi hosts.
- Existing routes will remain cached and there is no actual timeout value on the ESXi hosts or cache drainage for that matter (to my knowledge).
- All existing virtual machines will still be able to communicate E/W and N/S.
- If you create new virtual machines they will be able to join a VNI (remember that an LS is a port group), and they will be reachable (didn’t notice any abnormal behaviour here).
- Losing three controller cluster nodes:
- Everything when loosing two controller cluster nodes applies here as well except for the ARP suppression, here this is lost as well and you will be able to notice ARP requests on VMs that are residing on different logical switches.
The up above results were based on tests that I have done in my lab (NSX-v 6.3.5 and ESXi 6.5 Update 1) and if you have any additional information that you can add I will update this blog post accordingly.
Again this is pretty rare and most probably if a controller cluster node is totally lost then you should probably redeploy a new one, other than that you should open a support ticket with VMware GSS and let them handle it from there.
I hope this has been informative and thank you for your time,