Relocation of a Stretched Cluster

Happy and succesful new year to all the readers!

Last year I had two relocation projects, each with a VxRail Stretched Cluster. Since I have already written the concept I dediced to share it on my blog for everybody who may need some guidance.

Btw, you can also use this guide if you need to perform maintenance on one of the Stretched Cluster sites – for example in case of a network refresh or the yearly UPS test.

// Requirements

The main requirement during the relocation is that there will be no downtime of the services respectively the VMs running on the VxRail Stretched Cluster.

For the relocation of the VxRail nodes, the technical requirements are the following:

The rack at the destination must have enough space and must be prepared with adequate PDUs which are hopefully connected to an UPS.
The network connection and configuration for the relocated VxRail nodes must be the same on the new location as on the old location.
Necessary services for the VxRail environment (vCenter Server, vSAN Witness Appliance, DNS- and NTP servers, etc.) must be available and healthy.
The ESXi Cluster and/or Host (incl. the data store and the network) where the vCenter Server and/or the vSAN Witness Appliance reside must be available and healthy.
The utilization of the VxRail cluster should not exceed 45% of all available resources in compute (CPU, RAM) before the relocation.

// Options

To meet the requirements, there is only one reasonable option for the relocation of the hardware. Depending on the used default storage policy, it should be possible to shut down all nodes of one fault domain (in one data center) for the necessary time to move the physical nodes to the new location.

All VMs running on the Stretched Cluster will then run only on the nodes of the remaining fault domain. Therefore, both fault domains need to have sufficient compute and storage capacities to be able to fulfil the resource requirements of all VMs. This is a prerequisite.

To be able to restore regular operation as quickly as possible after the relocation of the VxRail nodes, the correct network configuration in the new data center is a prerequisite. All necessary VLANs and ports, which are used in the cluster right now, must be preconfigured accordingly on the physical switches in the new location. All necessary services (like vCenter, Witness, DNS or NTP) must be available using these switches and break out connections. All relocated VxRail Nodes must be connected using the same uplinks (for System Traffic like ESXi Management, vMotion, vSAN Data and Witness traffic, for VLANs of the virtual machines and for iDRAC) to these switches.

To minimize the risk during relocation, it is highly recommended to use a VxRail version of at least 7.0.3xx in the environment (incl. vSAN version 7 U3). In this vSAN version there is an improvement regarding the behavior in case of the failure of one fault domain (1 DC or the Witness) shortly followed by a failure of a second fault domain (1 DC or the Witness):

“Stretched Cluster site/Witness failure resiliency. This release enables Stretched Clusters to tolerate planned or unplanned downtime of a site and the Witness. You can perform site-wide maintenance (such as power or networking) without concerns about Witness availability. “

https://docs.vmware.com/en/VMware-vSphere/7.0/rn/vmware-vsan-703-release-notes.html
https://www.yellow-bricks.com/2021/10/04/vsan-7-0-u3-enhanced-stretched-cluster-resiliency-what-is-it/

Relocation plan

The VxRail relocation can be planned as a complete lift and shift together with all the networking devices if they still will be used after the relocation.

Based on the technical design and the configuration of the VxRail Stretched Cluster the relocation should have no impact on the virtual machines and its running applications or services. All VMs will be either migrated to the remaining fault domain or shut down during the relocation.

// Infrastructure

A coordination with the responsible infrastructure team is mandatory! The vCenter Server and the vSAN Witness Appliance which are part of the VxRail Stretched Cluster should be running without any failures or issues.

// Network

A coordination with the network team is mandatory!

The switches are not a component of the VxRail cluster, but they are essential for it! Before the relocation it is highly recommended saving the switch configuration if they are relocated as well!

If relocated the switches must be de-cabled and removed from the racks as the very last component after the VxRail nodes and all other systems have been shut down in a controlled manner!

The switches must also be the very first component which must be racked and cabled if relocated. They must be running error free and correctly configured in the new location before the VxRail nodes, and all other systems can be powered on!

// General preparations

The following preparations should be done:

Jump Host
A jump host (or similar) with network connection to the management systems of the VxRail cluster and corresponding applications (browser, SSH client, etc.) to access the systems should be available.

Documentation
The latest documentation for all involved systems (VxRail, network, etc.) should be available. That includes a complete cabling plan for all involved systems and a rack layout plan.

System access
IP addresses and the corresponding administrative user accounts with passwords for access to all management systems should be available.

// Specific preparations

The following tasks should be done / scheduled to ensure the relocation is possible with no downtime of the production virtual services:

Health monitoring
The health of the external vCenter Server, the VxRail cluster and the vSAN Witness Appliance should be monitored on a regular basis.

Resource monitoring
The utilization of the compute resources of the VxRail cluster should be monitored on a regular basis to ensure that the utilization of CPU and Memory is below 50% so that all virtual workloads can be run on the nodes in the remaining fault domain.

Review of Storage Policies
The assigned policies should be checked to see if powering off a site has an influence on the data availability for all VMs. It could be necessary to change some policies before and after the relocation.

Review of DRS rulesets and groups
The implemented DRS ruleset needs to be reviewed to be able to migrate all workloads to the nodes residing in the remaining fault domain. In case a ruleset will prevent the migration, it will be necessary to disable it during the relocation

Definition of highly sensitive virtual machines
The virtual machines residing on the VxRail cluster need to be reviewed to define highly sensitive virtual machines that should not be migrated during the production hours.

Definition of non-productive virtual machines
The virtual machines residing on the VxRail cluster need to be reviewed to define non-productive virtual machines that can be shut down during the relocation (e.g. Backup proxies, test and developer systems). This can be helpful to lessen the migrations and/or save actively used CPU and Memory resources.

Spare parts
Simple spare parts like network and power cables or even optical transceivers should be available in case of a malfunction or defect.

// Day before Relocation

The following tasks should be done the day before the physical relocation of the nodes:

Health check
An extensive health check should be done before continuing with the additional tasks. A log should be kept during the relocation, to take notes and screenshots in the case of issues.

Backup
All virtual machines of the cluster should be in an active backup job which should run before the relocation.

Disabling DRS Rules
All DRS rules preventing a migration need to be disabled

Set DRS Automation level
To prevent DRS from migrating virtual machines back the Automation level should be set to partially automated

Shutdown of virtual machines
All non-productive virtual machines should be shut down and can be migrated offline.

Migration of virtual machines
All other virtual machines need to be migrated online using vMotion. In case there are highly sensitive virtual machines those should be migrated outside of regular service times.

// Relocation day – before transport

On the relocation day the following tasks should be performed:

Final Health check
A final health check should be performed to validate the current state of the cluster.

Backup check
A check of all backup jobs should be performed.

Checking if new virtual machines are deployed on the wrong nodes
Verification that there no new virtual machines on the nodes which get relocated. If there are any new virtual machines those need to be migrated.

Monitoring
All VxRail systems should be set to maintenance in the monitoring tool and the operational service teams should be informed accordingly.

Adjust vSAN object repair timer
The vSAN object repair timer should be adjusted to approximately 600 mins which equals 10 hours (the default value is 60 mins). The value can be adjusted according to the final time plan of the relocation.

Maintenance Mode on nodes which will be relocated
All VxRail Nodes which will be relocated need to be put into maintenance mode with the option “No data migration”.

Shut down of nodes which will be relocated
After the VxRail Nodes are in maintenance mode the nodes can be shut down in a controlled manner.

Dismounting of nodes which will be relocated
After the shutdown the nodes can be de-cabled, dismounted, and packed safe for transport.

// Relocation day – after transport

In the new location the following tasks need to be done to bring the cluster back to operation:

Rack and stack
The nodes must be racked and cabled according to the provided cable plan. It is mandatory that the network is already up and running before this task.

Initial access and health check over iDRAC
After step one the nodes should be accessible over their iDRAC interfaces. A first check regarding the hardware health should be done immediately. If there are no errors the nodes can be powered on.

Network connectivity checks
After the nodes are powered on the network connectivity checks should be performed (e.g. vmkpings). If there are any problems, they must be fixed before the next step. A vDS Health Check can be helpful to determine if there are VLANs missing on the physical switch ports.

Exiting maintenance mode
If all nodes have the expected network connectivity the nodes can be put out of maintenance mode.

vSAN monitoring and adjusting of the object repair timer
The vSAN status and the resyncing operations needs to be monitored closely. The vSAN object repair timer should be adjusted to its default value of 60 mins if everything is healthy.

Adjust DRS ruleset and automation level
After the resync is complete the DRS ruleset can be activated again. To have an even utilization within the cluster the DRS automation level can be set back to fully automated again. In case there are highly sensitive virtual machines those should be migrated manually outside of regular service times.

Turn on powered off virtual machines
All non-productive virtual machines which have been powered off for the relocation can be powered on again.

Regular health checks and monitoring
After the relocation the system should be monitored closely. Additionally, health checks should be performed regularly (hourly).

// Day(s) after Relocation

To ensure a fully operational system the following tasks should be performed the day after the relocation:

Backup
It must be checked if all backup jobs are working or if there any issues.

Monitoring
The system should be monitored closely to be able to react fast in case of any issues.

Regular health checks
Health checks should be performed regularly.

Open issues
In case of open Warnings, Alarms and Errors or if unexpected problems have occured that remain unresolved by the persons involved in the relocation, the Dell Customer hotline must be contacted.

If you have read up to this point, I hope my article was helpful to you. Feel free to share if you like…

// footnotes:

Date: 03.01.2023
Version: 1.0