vSAN cluster shutdown & startup

Since the vSphere/vSAN version 7 Update 3 has been withdrawn and is about to be re-released shortly (but hopefully not installed in production very soon!), I had to use the “traditional” way to manually shutdown and startup two vSAN clusters of a customer due to power maintenance using the VMware procedure:

Manually Shut Down and Restart the vSAN Cluster
https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.vsan-monitoring.doc/GUID-31B4F958-30A9-4BEC-819E-32A18A685688.html

The VMware procedure is well documented but not 100% accurate. That’s why I want to share my experience. Maybe it can be helpful for someone out there.

// important note

This article has only been created for VMware vSAN 7 Update 2 and has been successfully tested several times. With vSAN 7 Update 3, the cluster shutdown functionality is going to be automated as a new wizard will be introduced by VMware in this release. The startup procedure will probably stay like this.

Btw, this procdure can also be helpful for VxRail Clusters if the automated workflow is not working correctly.

// preparation
Conscientious adherence to the shutdown & startup plans is necessary to perform all steps in the order dictated by dependencies and not to omit or forget any step

/ DNS-Server
At least one of the DNS servers configured in the environment must be available at all times or at least until the end of the cluster shutdown. If the DNS server VM resides on the same cluster it must be shut down last and startup first.
All management systems and vSAN nodes should be resolvable via DNS (forward and reverse).

/ NTP-Server
The NTP server configured in the environment should also be accessible at all times. All management systems and nodes of the cluster should run with the same time.

/ other things to consider
– A Jump Host with network access to all systems involved should be operational.
– IP addresses and the corresponding administrative user accounts with passwords for access to all participating systems should be available. Ideally the complete documentation for all systems involved should be available.
– Backups of all involved management and customer systems should be current and complete. The backup server should be up and running in case of an emergency restore.
– Monitoring systems should be put in maintenance accordingly.
– All people involved or affected should be put on notice as early as possible.
– A log should be kept during the procedures.

// basic health checks
The vSAN cluster should be “Healthy” before the shutdown and of course also after the startup. If there should be open warnings or alarms they must be checked first and fixed if necessary.

/ vCenter Skyline Health
The vCenter Skyline Health Check should be checked for new or open warnings and alarms.

/ vCenter Issues & Alarms
Check for new or open warnings and alarms on vCenter level. These can be found under “All Issues” or/and “Triggered Alarms”.

/ Cluster Issues & Alarms
Additionally check for new or open warnings and alarms on cluster level. These can be found under “All Issues” or/and “Triggered Alarms”.

/ vSAN Skyline Health
The vSAN Skyline Health Check should be checked for new or open warnings and alarms.

/ vSAN Virtual Objects
Under Virtual Objects all objects should be “Healthy”.

/ vSAN Resyncing Objects
Nothing should be pending under Resyncing Objects.

/ special services
If you use special services like vSAN iSCSI target services, SMB/NFS file services or HCI Mesh you should additionally check everything accordingly.

// Cluster shutdown

1 – Customer systems
All customer VMs running on the cluster and all physical servers accessing a service of the cluster (such as vSAN iSCSI Target Service, file services or HCI Mesh) must be put in maintenance or shut down first!

2 – Management VMs
All VMware management VMs can be shut down on the vSAN cluster except for the following systems, which must remain powered on until last:

3 – DRS
The vCenter VM can be migrated to a dedicated vSAN node for simplicity. To ensure that it remains there and can be found there after startup, DRS should be set to a low value (e.g., level 1).

4 – ESXi Lockdown Mode
The ESXi Lockdown Mode should be deactivated on all vSAN nodes before shutdown to have more access possibilities in case of failure.

5 – ESXi SSH Service
The SSH service has to be started on all vSAN nodes before shutdown to be able to connect to all vSAN nodes via SSH afterwards.

6 – vSphere HA
vSphere HA must be temporarily disabled to avoid provoking unnecessary HA failover processes.

7 – vCLS Retreat Mode
The vCLS Retreat Mode must be temporarily enabled (= false). Thus all vCLS VMs are removed cleanly and automatically.

8 – Health Check
Now a short final health check should be performed.

9 – vCenter Server
If possible, the vCenter VM should be shut down via the Actions menu of the VAMI.

10 – DNS server VM
If the DNS server VM resides on the cluster it should be shutdown last via the ESXi Host Client!

11 – Cluster Member Updates
The following command must be issued on all vSAN nodes of the cluster via SSH:
# esxcfg-advcfg -s 1 /VSAN/IgnoreClusterMemberListUpdates

12 – Reboot Helper Script
The following script must be run on one vSAN node in the cluster via SSH:
# python /usr/lib/vmware/vsan/bin/reboot_helper.py prepare

13 – Maintenance Mode
The following command must be issued on all vSAN nodes of the cluster via SSH:
# esxcli system maintenanceMode set -e true -m noAction

14 – vSAN Nodes
Now all vSAN nodes can be shut down via the ESXi Host Client:

15 – OOBM
The OOBM does not have to be switched off. It continues to run during network work or goes off during power work and comes on when power is restored.

Now finally all vSAN nodes can be powered off.

// Cluster startup

1 – OOBM
The OOBM (out-of-band management) is always on or comes on as soon as power is restored.
At first glance, everything should be “green” in the dashboard:

2 – vSAN Nodes
All vSAN nodes should be powered on.
Under System, all components should be “green”.
Under Storage, the SSDs and disks should show up.
No alarms should show up in the System Events.

3 – ESXi SSH Service
After startup, the SSH service must be activated on all vSAN nodes via the ESXi Host Client, so that you can connect to all vSAN nodes via SSH.

4 – Maintenance Mode
The following command must be issued on all vSAN nodes of the cluster via SSH:
# esxcli system maintenanceMode set -e false

5 – Reboot Helper Script
The following script must be run on one vSAN node in the cluster via SSH:
# python /usr/lib/vmware/vsan/bin/reboot_helper.py recover

6 – Cluster Status
The following command can be used to check the cluster status on all vSAN nodes in the cluster.
# esxcli vsan cluster get
One node in the cluster should be master, a second should be backup and all others should be agents. The cluster settings should be identical for all vSAN nodes (except for master and backup IDs etc.).

7 – Cluster Health
All health checks for the vSAN datastore should be “green” in the ESXi Host Client:

8 – Cluster Member Updates
The following command must be issued on all vSAN nodes of the cluster via SSH:
# esxcfg-advcfg -s 0 /VSAN/IgnoreClusterMemberListUpdates

9 – DNS server VM
If the DNS server VM resides on the cluster it should be startup first via the ESXi Host Client!

10 – vCenter Server
Now the vCenter Server VM can be booted via the Host Client:

11 – vCenter Server Check
The vCenter can be checked for functionality via the VAMI (login as root user) until the vSphere Client Service is started.
The Health Status should be “Good” and the Single Sign-On Status should be “Running”:
All services that are on Automatic as Startup Type should be “Started” and “Healthy”, especially the vSphere Client to be able to log in there afterwards:

12 – vCLS Retreat Mode
The vCLS Retreat Mode must be switched off again (= true). Thus all vCLS VMs are deployed automatically.

13 – vSphere HA
vSphere HA must be enabled again.

14 – Health Check
At this point at the latest, a detailed health check should be performed.

15 – Management VMs
If all health checks are in order, the remaining management VMs from VMware can be booted on the vSAN cluster. The following should already be available:

16 – ESXi SSH Service.
SSH service should be stopped again on each vSAN node.

17 – ESXi Lockdown Mode
ESXi Lockdown Mode should be re-enabled on each vSAN Node and the corresponding users should be authorized.

18 – DRS
DRS can be set back to its previously defined value (e.g. level 3).

19 – Customer systems
Before the customer system can be started, the network and all vSAN nodes must run cleanly. If all health checks are ok, all customer VMs running on the vSAN cluster and all physical servers accessing a service of the vSAN cluster (like vSAN iSCSI Target Service or file services or HCI Mesh) can be started one by one. It is important to know the application and system dependencies and to follow the correct startup order. This task is in the hands of the customer, as well as the check of all customer systems and their applications or services.

// post-treatment
Increased attention and more frequent checking of monitoring is essential. And of course the backups jobs should also be reviewed.

If you have read up to this point, I hope my article was helpful to you. Feel free to share if you like…

// footnotes

Date: 27.01.2022
Version: 1.0