Migrating from a VMware NSX Collapsed Cluster to Dispersed Clusters

by Abdullah · Published October 3, 2017 · Updated October 5, 2019

Not something that one would do on a daily basis, but being on the field you sometimes get into situations where the customer wants to go into the implementation but because of procurement issues you will not be able to implement the design which was set and namely here a VVD oriented design.

So for this case, we were into a situation where the implementation needs to go live and the procurement of the rest of the hardware is going to take more time and as such it was decided to go with a collapsed NSX cluster (Management, edge and compute) on the same ESXi hosts cluster sharing the same vDS as well.

Yet, because we were expecting that we will be getting the new hardware at a later stage then this suggests a migration and a separation of the roles where we will have:

A separate management cluster.
A separate compute cluster (probably will be the current collapsed cluster).
A separate edge cluster.

When in production the first thing you would think of is downtime, maintenance periods, risks and implications of doing such a major migration especially with the bits a pieces that are introduced with NSX. I took this task upon my lab and I have simulated the procedure and logged my results and I hope the here-under gives you a clearer picture on how this could be done.

1. Management Cluster:
  1. The management hosts at this point should be up and running and production ready.
  2. vCenter Server:
    1. Option1: Create a new vCenter Server with embedded PSC (this will have its own domain and site) [on one of the management hosts].
    2. Option2: Create a new vCenter Server and join it to the existing PSC domain and site [on one of the management hosts].
  3. Create a new cluster for management hosts.
  4. Create the vDS and migrate the management hosts to it.
  5. Configure storage.
  6. Configure HA.
  7. Migrate the existing management virtual machines from the collapsed cluster to the new management cluster:
    1. If you chose vCenter Server option1:
      1. Either via a transitory host where you migrate by adding a jump host and removing the VMs from the inventory and re-adding them and later on detach the host and add it to the new cluster..
    2. If you chose vCenter Server option2:
      1. You can just vMotion those virtual machines to the new management cluster.
      2. Make sure here to cater for EVC and have both clusters on the same level so that to be able to do the migration online.
2. Compute Cluster:
  1. Remains as is, and the VMs will still remain on the same host.
3. Edge Cluster:
  1. Create a new cluster within the Compute vCenter server.
  2. Add the Edge ESXi hosts to it.
  3. Create a vDS for the edge cluster and configure the necessary port groups including the ESGs uplinks.
  4. Prepare the new cluster for VXLAN (Agent, VXLAN and transport zone).
  5. Controllers:
    1. Deploy the controllers on the new Edge cluster, by removing each one at a time and re-adding it to the Edge cluster.
  6. ESGs (Dynamic Routing with ECMP):
    1. Create a new edge (using the new FC uplinks) and peer it with the DLR.
    2. Peer the new edge/s with the border/transit leafs.
    3. Verify routing table/s.
    4. Drain the routes coming from the old ESGs (either by weight for BGP or Priority for OSFP).
    5. Allow for a period of 1/2 days for the network/applications to update their routing tables and verify on the routing table on a sample of virtual machines.
    6. Remove the DLR peering with the old ESG/s [at this point there shouldn’t be any disconnections unless a VM is still having it routes going through the old set of ESGs]
    7. Verify routing table/s.
    8. Remove the old ESG/s.
  7. ESGs (Dynamic Routing / Static Routing with HA + Services):
    1. For these appliances HA will be on.
    2. Remove the passive node from the collapsed cluster.
    3. Redeploy it in the edge cluster.
    4. Make the new passive node active (here there will be a downtime and there is no way around it especially for the services and you will need to cater for that where you can take the dead timer down to 6 seconds which is the very minimum and then set it to 9 which is the preferred minimum later on).
  8. DLR Control VM:
    1. Static routes:
      1. These static routes will be created to protect against any black-hole during the DLR control VM migration (it is safe to keep these in production but someone will need to cater for adding every new entry for each logical switch or you can have a summarized route on the ESGs that is).
      2. Create a set of static routes on the DLR towards the ESGs or create default route 0.0.0.0/0 towards the ESGs with a higher admin distance.
      3. Create a set of static routes on the ESGs towards the DLR to cover the logical switches with a higher admin distance.
    2. DLR should be deployed in HA mode presumably.
    3. Relocate the DLR’s passive control VM to the edge cluster (it will be re-created there).
    4. Change the admin state of the active DLR control VM (in the compute cluster) to down where the dead timer will be set to 9 (recommended lowest) [at this point there shouldn’t be any disconnections as once the control VM is no longer peering and static routes will be used].
    5. Verify that the passive node on the edge cluster is now active and that all VXLAN communication has been restored.
    6. Redeploy the passive appliance existing on the collapsed cluster to the new edge cluster.
  9. Re-validate connectivity end-to-end.
  10. Revisit the DRS rules and make the changes necessary.

I have done this in a lab environment successfully and I have not tested this in a production environment myself, once this is achieved I will update this blog post with the results and any modifications that needs to be changed as a lab is not idea get that 100% certainty :).

One more thing, even though the up above does its best to get everything done without downtime or the most minimal downtime you should schedule a maintenance period with at least 2 hour so that to give yourself a breadth of time to troubleshoot or revert back in case something major forces you to.

(Abdullah)^2

3091 Total Views 2 Views Today

Migrating from a VMware NSX Collapsed Cluster to Dispersed Clusters

Leave a Reply Cancel reply