VxRail update guide – make IT work

Last week a customer asked me to write a guide he can use to update his VxRail clusters on his own in the future. Because I think this guide could also be useful for others out there I decided to share a more or less generalized version on my blog.

// important notes

This guide describes my recommendations and actions to be performed before and during a VxRail update to ensure that it runs properly and as error-free as possible.

It is intended for “minor” VxRail updates only – i.e. patches in the same major release. Upgrades to new “major” releases can very likely be different and should be prepared accordingly in a dedicated manner.

The instructions refer exclusively to the VxRail cluster and the systems required for it.

It is assumed that the reader has basic knowledge in virtualization and is familiar with the administration of his environment.

// preparations

Conscientious adherence to the update procedure is necessary to perform all steps in the order dictated by dependencies and not to omit or forget any step.

– A Jump Host with network access to all management systems should be ready for use.
– IP addresses and the corresponding administrative user accounts with passwords for access to all management systems should be available.
– Documentation for all involved systems (VxRail, network, etc.) should be available.
– A log should be kept, screenshots taken and log files collected – not only in the case of a failure.
– All VxRail systems should be set to maintenance in the monitoring tool and the operational service teams should be informed accordingly.

// DNS and NTP

DNS and NTP are essential services for any vSphere, vSAN/VxRail or VCF environment.

All management systems should be able to resolve via DNS. At least one of the DNS servers configured in the environment must be available at all times or until the end of the cluster update.

All management systems should run with the same time. The NTP server configured in the environment should be accessible at all times.

// external vCenter Server

If you have an external vCenter Server which is not managed by the LifeCycle of VxRail, you have to check the supported version and most probably have to update it before the VxRail cluster.

Look especially for certificate/thumbprint, SSO, DNS or disk space issues.

A very useful tool to check the health of your vCenter before doing an update is the VMware Fling called vSphere Diagnostic Tool (vdt) which you can get here:
https://flings.vmware.com/vsphere-diagnostic-tool

// vSAN Witness Appliance

If you have a vSAN Witness Appliance in a Stretched Cluster or a 2 Node Cluster, you can decide if you want to update it manually first or use the LifeCycle of VxRail to do the update for you. The Witness needs to be on the same ESXi build as the VxRail nodes.

// Solution compatilibity & interoperability

All vCenter-integrated solutions (like NSX, vRealize, Horizon or Citrix, etc.) should be checked for version compatibility & interoperability with the new VxRail/vSphere version. For minor VxRail updates there should be nothing to do but I would recommend to check it every time.

At least update your Backup server a few days before!

// vMotion compatibility

The goal of any update should be to update the environment in a way that there is no interruption to the cluster’s virtual workloads and provisioned services.

Since most of the time all VxRail nodes are updated, all VMs of the cluster will be migrated automatically at least once “live” to another node using vMotion. There probably may be a few systems that should not or cannot be migrated live for whatever reason. It is up to the customer to know his systems and prepare them for the update accordingly.

// DRS Affinity Rules

Due to a VxVerify warning, it may be necessary to disable certain DRS rules before the update. If VMs are not allowed to leave a host (host affinity “must” rules for backup proxy VMs, for example), they must be shut down before the update. After the update, they can be started up again or the DRS rule can be enabled.

// backups

Backups of all management systems should be up-to-date and complete before the update.

I would also recommend to take an offline snapshot of the vCenter and the VxRail Manager before the update in order to get back to the initial state very quickly in the case of a failure.

The open snapshots of the management systems should be deleted promptly after the successful update, as soon as the environment is running without errors or if there is no way back.

In the days after the update, it must be checked whether all backups are running cleanly and completely.

// basic health checks

The VxRail cluster should be “Healthy” before and after the update! I would not recommend to perform the update as long as there are open warnings or alarms. These must be checked and if necessary fixed before the update!

Warnings and alarms occurring during the update can be caused by the update process itself and most of them get resolved automatically. However, know-how and a little experience are required to be able to assess the warnings and alarms for criticality and to be able to derive direct need for action.

I always check the following:

vCenter Skyline Health Checks
vCenter All Issues & Triggered Alarms
vCenter Appliance Management Interface (Health status, SSO status, Services states)
vCenter Server CLI (service-control –status, partition fill levels, etc.)
Cluster All Issues & Triggered Alarms
vSAN Skyline Health Checks
ESXi Host Client Health Checks for the vSAN Datastore
vSAN Virtual Objects
vSAN Resyncing Objects
VxRail Appliances View of the VxRail Plugin (Cluster Health, Operational State, Appliances States)
VxRail Manager CLI (systemctl status vmware-marvin, docker node & service status, partition fill levels, etc.)
iDRAC (Dashboard, System and Storage views, System events)

Also don’t forget to check the health of vSAN iSCSI Target Services, vSAN File Services or HCI Mesh Remote Datastores if they are in use in your environment.

// Update procedure

// Release Notes

Viewing the VxRail Release Notes is generally recommended in order to know which of the components (including vCenter, ESXi, BIOS, iDRAC, firmware, etc.) will get updated. Furthermore, you will find Fixed and Known Issues for each version and can therefore better judge how important an update is or if there is something special to consider.

The Release Notes can be found here:
https://support.emc.com/docu98130_VxRail-7.0.x-Release-Notes.pdf

// SolVe procedure

For most activities in a VxRail environment (installations, updates, extensions, replacements or even configuration adjustments), there are officially supported procedures.

The procedures contain, among other things, the selected options when creating the respective procedure, important knowledge base articles about the selected procedure that you should review beforehand, recommended materials that you will need in the process and recommended activities that you should perform before – such as taking snapshots of the management systems and using the VxVerify script.

I would recommend to create a new, dedicated procedure for each update process at the beginning of the update planning!

The procedure can be created at SolVe Online:
https://solveonline.emc.com/solve/home/51

// VxVerify

VxVerify is a Python script that performs a comprehensive analysis of the state of a VxRail cluster. It is highly recommended to run VxVerify before upgrades, updates, extensions or for general maintenance. When VxVerify is run on VxRail Manager, it sends so-called “minions” (small Python programs) to each VxRail node in the cluster. These minions, in turn, perform checks on each node. In addition to ESXi host-specific tests, VxVerify also performs checks at the VxRail Manager, VMs, vCenter and cluster levels.

VxVerify can detect many potential known issues in advance, helping to identify potential showstoppers early enough to get them out of the way in time for the update. The script can be started manually at any time and multiple times. It is also included in the Update Pre-Check.

I would recommend to start VxVerify for the first time one to two weeks before the planned update and to run it as often as necessary until all reported warnings and alarms are resolved or can be ignored.

The latest script can be downloaded as a zip file from the following page:
https://www.dell.com/support/kbdoc/en-us/000021527/vxrail-how-to-run-vxverify

The unzipped script must then be copied to the VxRail Manager (e.g. via WinSCP). On the VxRail Manager it can be executed with the flag “-r” to check the vCenter as well:

# python vxverify_<version-no>.pyc -r root
(followed by entering the vCenter root password).

The result, which is also saved in a text file for later use, contains corresponding warnings (yellow) or alarms (red). Alarms must be fixed before the update! Warnings should be checked individually, if they are really a “problem” for the update and also have to be fixed or if they are e.g. only a hint to check a component (like vCenter Server, Witness Appliance, NSX, etc.).

The resulting Dell KB article numbers can be looked up here:
https://www.dell.com/support/contents/en-us/category/product-support/self-support-knowledgebase

// VxRail Bundle

The VxRail Cluster can be updated online (Internet Update) or by bundle (Local Update).

The dedicated “Upgrade Package” with the desired target version can be downloaded from the following page at Dell:
https://www.dell.com/support/home/en-us/product-support/product/vxrail-software/drivers

In general, it should be in everyones best interest to use stable and proven versions. I would not recommend to install the latest bundle directly in the following days after it gets released in a production environment. The updates and/or the bundles have been tested and validated many times by Dell, but there is always a small risk that a bug may still exist in one of the many components that has not yet been discovered.

However from the day of the publication you should start with the planning of the update. It is always necessary to check the dependencies of the vCenter-integrated solutions for compatibility (e.g. Backup). Under certain circumstances, it may be necessary to apply an update there first.

// Update Pre-Check

After downloading or uploading the bundle to the VxRail Manager, a Pre-Check can be performed before the update. This pre-check also contains the VxVerify script, but additionally checks further points in the vSphere/vSAN environment that cannot be checked by VxVerify.

I would recommend to run the Pre-Check at least one day before the update and to fix or clarify all reported warnings and alarms.

// Update Start

If there are no more open warnings or alarms in all the health checks or the Pre-Check (incl. VxVerify) and the backups of the management systems have been made, the VxRail update can be started.

For most updates you can expect a duration of about 0.5h per management system and 1h per node. I.e. one should start accordingly early, if one would like to update a larger cluster. Since the latest 7.0.24x version you will be able to pause an update (for example after one Stretched Cluster site). But until today I haven’t tried this.

The update sequence is usually as follows: first the VxRail Manager, then the vCenter and then all VxRail nodes one after the other, starting with the smallest serial number.

Sometimes it is not recognizable in the vSphere Client or in the VxRail Update Menu what or if the process is (still) doing anything at all. Please stay calm and have a little patience.

A look into the update log of the VxRail Manager via SSH can provide clarity:
# tail -f /var/log/mystic/lcm-web.log
# tail -f /var/log/mystic/web.log
# tail -f /var/log/mystic/lcm-do.log

After the VxRail Manager update (or even after a vCenter restart), the VxRail plugin in the vSphere Client (and thus the VxRail Update menu) may be temporarily unavailable. It may take a moment for the plugin to update (or load) and may require a browser refresh.

During the vCenter update and restart, the connection to the vSphere client is temporarily lost. If possible, you should remember the node on which the vCenter VM is running beforehand and connect to this node with the ESXi Host client. During the vCenter update you can monitor the update and boot process in the VM Console. After starting the vCenter, you can check in the VAMI if all vCenter services are starting and log back into the vSphere Client once the service has started.

To be able to monitor the update and boot processes of the VxRail nodes, you can connect to the iDRACs. During the iDRAC update, the connection to the iDRAC is temporarily lost. When the iDRAC has been restarted (it can be monitored e.g. with the help of a continuous ping), you can log in again.

In certain cases, for example if you have not updated the vSAN Witness Appliance first or have not chosen the VxRail option to do it for you automatically, you have to do it now manually. And after that there will probably even wait another manual task, namely the vSAN Disk Format Update.

Finally, if you use the Dell Secure Remote Services (o.s.) for Call Home and/or VMware vRealize Log Insight for syslog you should not forget to manually update them now.

// post-checks

If the update has run successfully and all subsequent health checks are good, the customer’s systems and their applications or services running on the VxRail cluster and all physical servers accessing a service of the VxRail cluster (such as vSAN iSCSI Target Service, File Services or Remote Datastores) should also be gradually checked for functionality.

Increased attention and careful checking of the monitoring and backup systems are essential after the update. If unexpected problems occur that cannot be solved by yourself, various support options are available.

If you have read up to this point, I hope my article was helpful to you. Feel free to share if you like…

// footnotes:

Date: 31.10.2022
Version: 1.2

1 thought on “VxRail update guide”