HCI Security vSAN vSphere VxRail

VxRail Upgrade Guide

After I have written the VxRail Update Guide I wanted to write about major version upgrades as well since there are still some companies out there running VxRail Version 4.7.xxx (which is by now End of General Support). Usually upgrades between major versions are a little bit more complex than updates within the same major version. That’s why it took a little longer than expected. Hope you still can you use it – if not for upgrades, then for updates where you can skip some tasks.

Also check the addition to my previously released Update Guide and this Upgrade Guide with new topics regarding the upgrade to VxRail Version 8.0.xxx: VxRail Upgrade to Version 8.0.xxx

// Important Notes

This guide describes the recommendations and actions that should be performed before, during and after a VxRail upgrade so that it can run properly and as error-free as possible and the production environment keeps running without interruptions.

It is intended for VxRail updates and upgrades – but has no guarantee of completeness or correctness. Upgrades to new “major” releases in the future can very likely be different and should be prepared accordingly every time.

It is assumed that the reader has basic knowledge (Basic Admin Knowhow) in network, Windows, Linux, vSphere and vSAN or VxRail and it is highly advised that the reader is familiar with the administration of his environment.

// Recommendations

It is recommended running an VxRail upgrade only when the environment is healthy!

Regular checking for consistent and up-to-date full backups of the VMs is a mandatory prerequisite to be able to compensate for any data loss.

// Out Of Scope

The guide exclusively refers to VxRail clusters and their necessary systems. To keep this guide as compact and clear as possible, detailed descriptions of the system structure are not included.

Troubleshooting measures are not discussed for reasons of space and complexity! In this case, specialized people should be consulted, or the respective manufacturer support should be contacted!

Further recommendations for the productive, virtual, and physical systems of the reader, their guest operating systems or their applications are not part of the document. These measures are the responsibility of the reader.

General Preparations

Conscientious adherence to the upgrade procedure is necessary to perform all steps in the correct order dictated by dependencies and not to omit or forget any step.

– A jump host (or similar) with direct network connection to the management systems of the VxRail cluster and corresponding applications (browser, SSH client, etc.) to access the systems must be available.
– IP addresses and the corresponding administrative user accounts with passwords for access to all management systems should be available.
– Documentation for all involved systems (VxRail, network, etc.) should be available.
– A log should be kept, screenshots taken and log files collected – not only in the case of a failure.
– All VxRail systems should be set to maintenance in the monitoring tool and the operational service teams should be informed accordingly.

// Release Notes

Reviewing the VxRail Release Notes is generally recommended to know which of the components (including vCenter, ESXi, BIOS, iDRAC, firmware, etc.) will get upgraded. Furthermore, you will find Fixed and Known Issues for each version and can therefore better judge how important an upgrade is or if there is something special to consider.

Release Notes can be found here:
https://support.emc.com/docu98130_VxRail-7.0.x-Release-Notes.pdf

In general, it should be in the interest to use as stable and proven versions as possible. Therefore, I do not recommend installing the bundles directly after the release in the productive environment. Even if the upgrades or the bundles have been tested and validated many times, an error may have crept into one of the many components that has not yet been discovered.

One can begin however from the day of the publication with the upgrade planning. It is always necessary to check the dependencies of the vCenter-integrated solutions for compatibility (e.g., Backup). Under certain circumstances, it may be necessary to apply an upgrade there first.

// (3rd party) Solution Compatibility & Interoperability

All vSphere- and especially vCenter-integrated solutions (like Dell VxRail, VMware NSX, Tanzu, vRealize Automation or Horizon, Citrix XenApp, XenDesktop or Netscaler, Vendor Plugins, etc.) should be checked for version compatibility and interoperability with the new vCenter/vSphere version. For minor VxRail upgrades there should be nothing to do but I’d recommend checking it every time.

At least your backup software (CommVault) should be upgraded a few weeks or at least a few days before, so that you can create a complete backup with the new version!

// Backup & Snapshots

Backups of all management systems should be up-to-date and complete before the upgrade. I’d recommend using both image-based and file-based backups of the vCenter Server and the VxRail Manager.

In the days after the upgrade, it must be checked whether all backups are running cleanly and completely.

I’d also recommend taking an offline snapshot of the vCenter and the VxRail Manager before the upgrade to get back to the initial state very quickly in the case of a failure.

The open snapshots of the management systems should be deleted promptly after the successful upgrade as soon as the environment is running without errors or if there is no way back.

A backup of components like vDS and Ressource Pools are also recommended.

// DNS- & NTP-Servers

DNS and NTP are essential services for any vSphere, vSAN/VxRail or VCF environment.

All management systems should be able to resolve via DNS (forward and reverse). At least one of the DNS servers configured in the environment must be always available or at least until the end of the cluster upgrade.

All management systems should run with the same time (UTC is recommended). The NTP server configured in the environment should be always accessible.

// External vCenter Server

Some use cases (like Stretched Cluster or 2 Node Cluster) require or recommend having an external vCenter which is not managed by the LifeCycle of VxRail. In this case you must check the supported version on your own and you must upgrade the vCenter before the VxRail cluster.

A very useful tool to check the health of your vCenter before doing an upgrade is the VMware Fling called vSphere Diagnostic Tool (vdt) which you can get here:
https://flings.vmware.com/vsphere-diagnostic-tool

Another nice tool by Dell is the debug_vxrm-vc-script which you can find here:
https://www.dell.com/support/kbdoc/en-us/000020539

Look especially for certificate/thumbprint, SSO/vmdir, DNS/NTP or disk space issues.

Some other common stumbling blocks for major vCenter “side-by-side”-upgrades can be:
– having managed ESXi hosts which are running older, unsupported versions (like ESXi 6.0 in a vCenter 7.0)
– having a vSphere Distributed Switch which is not converted from Basic to Enhanced LACP Support
– not having ephemeral port groups for the management VLAN on the vSphere Distributed Switch
– needing a share (aka export directory) for the vCenter Database (especially for configuration and historical data) during the migration if the space on the vCenter partition is not sufficient

Don’t forget to check the (3rd party) vCenter-integrated solutions (2.1.1.) if they are still working like they should after the vCenter upgrade.

// vSAN Witness Appliance

If you have a vSAN Witness Appliance in a Stretched Cluster or a 2 Node Cluster, you can decide if you want to upgrade it manually first or use the LifeCycle of VxRail to do the upgrade for you – depending on the Solve procedure and sometimes the version. At the end the most important thing to note is that the vSAN Witness Appliance must be on the same ESXi build as the VxRail nodes.

If you upgrade the Witness Appliance manually remember that you must do the vSAN Disk format upgrades also manually afterwards.

Another option can be to deploy a new up-to-date Witness Appliance and just replace the old one in the vCenter under Fault Domains.

Some common stumbling blocks for a Witness-upgrades can be:
– embedded ESXi-License issue (is resolved in latest versions)
– SCSI controller of the vSAN Witness Appliance (a Paravirtual SCSI controller is required)

After the upgrade from 4.7 to 7.x you will also have to change the vSAN Object Format manually to use some of the new availability features. That process can take a long time depending on the amount of data in the cluster and you also need enough free space in the cluster to be able to change the format of all objects in parallel.

// vMotion Compatibility

The goal of any upgrade should be to upgrade the environment without interruption to the cluster’s production workloads, applications, and provisioned services.

Since all VxRail nodes are normally upgraded, all VMs of the cluster will be migrated automatically at least once “live” to another node using vMotion. There probably may be a few systems that should not or cannot be migrated live for whatever reason (Latency-sensitive applications or databases, VMs limited by DRS-(Anti-)Affinity rules, Clustered systems, VoIP-Appliances, etc.). It is up to the reader to know all his systems and prepare them to get migrated online or offline if there really is no other possibility.

// DRS Affinity Rules

It may be necessary to disable certain DRS rules before the upgrade. If VMs are not allowed to leave a host (host affinity “must” rules for backup proxy VMs, for example), they must be shut down before the upgrade. After the upgrade, they can be started up again or the DRS rule can be enabled. If there are other “must”-rules they should be disabled or changed to “should” if possible.

// ESXi Lockdown Mode

The ESXi Lockdown Mode can be deactivated manually before the upgrade to have more possibilities to access management systems in case of failure. It should be reactivated directly after the upgrade and the corresponding users must be authorized.

// Other Important Topics

It is very important to create and read the dedicated Solve procedures before every upgrade. The procedure shows the specialties or new requirements which can be related to new versions.

One example from last year is the change in the Call Home functionality starting from VxRail 7.0.350 (via Secure Connect Gateway or VxRail Manager directly). If you are still using the Dell Secure Remote Services (SRS), you must upgrade it to Secure Connect Gateway (SCG) before the VxRail upgrade. That can also relate to a firewall change. The port 8443 is now required for SCG and/or VxRail Manager.

Another example are the requirements of new IP address ranges, subnets etc. for the Rancher Kubernetes Engine (RKE) which is used inside the VxRail Manager starting from VxRail 7.0.370. If you are using one or more of these IPs or subnets (and VLANs) in your environment, you must change some config files on the VxRail Manager.

A new software version may also mean changes in password requirements. I’d recommend changing them accordingly before the upgrade. Please check the password rules from VMware and Dell!

A new major version generally means new VMware licenses are needed. You will need to upgrade them in your VMware portal if the licenses are still in a current support and subscription contract. After upgrading them you have to add them to your vCenter to apply them accordingly.

Health Checks

The VxRail cluster should be “Healthy” before and after the upgrade! It is not recommended performing the upgrade if there are open warnings or alarms. These must be checked and if necessary fixed before the upgrade!

Warnings and alarms occurring during the upgrade can be caused by the upgrade process itself and most of them get resolved automatically. However, know-how and some experience are required to be able to assess the warnings and alarms for criticality and to be able to derive direct need for action.

If the vCenter Skyline Health Check is active in the vSphere Client, it should be checked for new or open warnings and alarms!

vCenter Issues and Alarms should be checked for new or open warnings and alarms on vCenter level! These can be found in the vSphere Client under “All Issues” or/and “Triggered Alarms”.

The vCenter can alternatively or additionally be checked for functionality via the VAMI (login as root user) if the vSphere Client is not available for a short time (e.g., during upgrades, reboots, offline snapshots, etc.). The Health Status should be “Good“, and the Single Sign-On Status should be set to “Running”. All services that are set to Automatic as Startup Type should be “Started” and “Healthy”, especially the vSphere Client, to be able to log in there afterwards.

The vCenter Server appliance can be checked for functionality via CLI/SSH (login as root user). Above all, the fill level of the partitions should be checked:
# df -h
The vCenter services should also be checked with the following commands:
# service-control -list
# service-control -status
https://kb.vmware.com/s/article/2109887

vSAN Cluster Issues & Alarms should be checked for new or open warnings and alarms on cluster level! These can be found in the vSphere Client under “All Issues” or/and “Triggered Alarms”.

The vSAN Skyline Health Checks in the vSphere Client should be checked for new or open warnings and alarms!
Alternatively, or in addition to the vSAN Skyline Health Checks, if the vSphere Client is not available for a short time, it can be checked in the ESXi Host Client if all Health Checks are “green” for the VxRail vSAN datastore.
Under Virtual Objects in the vSphere Client all objects should be “Healthy”!
No resynch task should be running or pending under Resyncing Objects in the vSphere Client!

Under VxRail Appliances in the vSphere Client, the cluster, and its Operational State, as well as all individual nodes should be “Healthy” or have a green check mark!

The VxRail Manager can be checked for functionality via CLI/SSH (login as mystic user, then change to root user).
The fill level of the partitions should be checked:
# df -h
The VxRail Manager services should be checked:
# service –status-all
The microservices can be checked for functionality with the following commands (pre-7.0.370):
# docker node ls
# docker service ls
The microservices can be checked for functionality with the following commands (post-7.0.370):
# kubectl get services
# kubectl get nodes -o wide
# kubectl get deploy -o wide
# kubectl get pods

The iDRAC can be used alternatively or additionally to check for functionality in case the ESXi client is temporarily unavailable (e.g., during ESXi upgrade, host reboot, etc.). At first glance, everything should be “green” in the dashboard.
Under System, all components should be “green”.
Under Storage, all SSDs and disks should show up.
No alarm should appear in the system events.

Also don’t forget to check the health of services or solutions like vSAN iSCSI Target Services, vSAN File Services or HCI Mesh Remote Datastores if you are using them in your environment.

Upgrade Process

// SolVe Procedure

For most activities in a VxRail environment (installations, upgrades, extensions, replacements or even configuration adjustments), there are officially supported procedures.

The procedures contain, among other things, the selected options when creating the respective procedure, important knowledge base articles about the selected procedure that you should review beforehand, recommended materials that you will need in the process and recommended activities that you should perform before – such as taking snapshots of the management systems and using the VxVerify script.

I’d recommend creating a new, dedicated procedure for each upgrade process at the beginning of the upgrade planning!

The procedure can be created at SolVe Online:
https://solveonline.emc.com/solve/home/51

The procedures are staggered according to permissions. Depending on the account and its permission level, a corresponding selection appears after login.

There are procedures
– that only the manufacturer himself is allowed to perform.
– which the manufacturer and a certified partner (such as SVA) may perform.
– which everyone, including every customer, is allowed to perform themselves.

If you do not find a procedure or are not sure if you need one at all, please contact the support.

Procedures can be created through the SolVe website (after logging in with your Dell account):
– Navigate to https://solveonline.emc.com/solve/home
– Click on the “All Products” tab and then on the “VxRail Appliance” tile.

In the “VxRail Appliance” Overview, you can expand the respective categories and select the desired sub-category. By clicking on the link “Software Upgrade Procedures”, you will then be guided through a dialog with the corresponding available options.

At the end, a PDF with the corresponding procedure will be generated for immediate download and you will also receive an email to the Dell account address with a temporary link to download the procedure.

The procedures contain, among other things, the options selected when creating the respective procedure, important knowledge base articles about the selected procedure that you should review beforehand, recommended materials that you will need in the process, and recommended activities that you should perform beforehand – such as taking snapshots of the management systems and using the VxVerify script.

// VxVerify Script

VxVerify is a Python script that performs a comprehensive analysis of the state of a VxRail cluster. It is highly recommended to run VxVerify before upgrades, upgrades, extensions or for general maintenance. When VxVerify is run on VxRail Manager, it sends so-called “minions” (small Python programs) to each VxRail node in the cluster. These minions, in turn, perform checks on each node. In addition to ESXi host-specific tests, VxVerify also performs checks at the VxRail Manager, VMs, vCenter and cluster levels.

VxVerify can detect many potential known issues in advance, helping to identify potential showstoppers early enough to get them out of the way in time for the upgrade. The script can be started manually at any time and multiple times. It is also included in the Upgrade Pre-Check.

I’d recommend starting VxVerify for the first time one to two weeks before the planned upgrade and to run it as often as necessary until all reported warnings and alarms are resolved or can be ignored.

The latest script collection can be downloaded as a zip file from the following page:
https://www.dell.com/support/kbdoc/en-us/000021527/vxrail-how-to-run-vxverify

The unzipped folder must be copied to the VxRail Manager (e.g., via WinSCP). On the VxRail Manager the script can be executed as the root user:

# ./vxverify.sh

Now you can choose what you want to do. For our use case we want to choose Upgrade check and then enter the version to which we want to upgrade. After that you we can enter vCenter root and vSphere SSO Administrator credentials to use the additional vCenter checks.

The VxVerify result, which is also saved in a text file for later use, contains corresponding warnings (yellow) or alarms (red). Alarms must be fixed before the upgrade! Warnings should be checked individually, if they are really a “problem” for the upgrade and must be fixed or if they are e.g., only a hint to check a component (like vCenter Server, Witness Appliance, NSX, etc.).

The resulting Dell KB article numbers can be looked up here:
https://www.dell.com/support/contents/en-us/category/product-support/self-support-knowledgebase

Since the latest versions VxVerify can restart stale sandboxes on the VxRail Manager and also iDRACs of the VxRail nodes! Because there have been some issues in the iDRAC version 5.x during upgrades and also during daily operations I would recommend to always restart all iDRACs just before starting the VxRail upgrade even if VxVerify has done already done it.

// VxRail Bundle

The VxRail Cluster can be upgraded online (Internet Upgrade) or by bundle (Local Upgrade).

For the two VxRail clusters with an “internal” vCenter included in the VxRail Lifecycle, a Composite Upgrade Package (but not Slim!) with the desired target version must first be downloaded from the following page at Dell:

https://www.dell.com/support/home/de-de/product-support/product/vxrail-software/drivers

  • VXRAIL_COMPOSITE-<Target-Version-Build>_for_7.0.x.zip

// Pre-Check

After downloading or uploading the bundle to the VxRail Manager, a Pre-Check can be performed before the upgrade. This pre-check also contains the VxVerify script, but additionally checks further points in the vSphere/vSAN environment that cannot be checked by VxVerify.

I’d recommend running the Pre-Check at least one day before the upgrade and to fix or clarify all reported warnings and alarms.

// Upgrade

If there are no more open warnings or alarms in all the health checks or the Pre-Check (incl. VxVerify) and the backups of the management systems have been made, the VxRail upgrade can be started.

For most upgrades you can expect a duration of about 0.5h per management system and 1h per node. I.e., one should start accordingly early, if one would like to upgrade a larger cluster. Since the latest 7.0.24x version you will be able to pause an upgrade (for example after one Stretched Cluster site). But until today I haven’t tried this.

The upgrade sequence is usually as follows: first the VxRail Manager, then the vCenter and then all VxRail nodes one after the other, starting with the smallest serial number.

Sometimes it is not recognizable in the vSphere Client or in the VxRail Upgrade Menu what or if the process is (still) doing anything at all. Please stay calm and have a little patience.

A look into the upgrade log or the general Micro Services log of the VxRail Manager via SSH can provide clarity:
# tail -f /var/log/mystic/lcm-web.log
# tail -f /var/log/microservice_log/short.term.log

After the VxRail Manager upgrade (or even after a vCenter restart), the VxRail plugin in the vSphere Client (and thus the VxRail Upgrade menu) may be temporarily unavailable. It may take a moment for the plugin to upgrade (or load) and may require a browser refresh.

If the (internal) vCenter is included in the VxRail Upgrade the connection to the vSphere client is temporarily lost during the vCenter upgrade and restart. If possible, you should remember the node on which the vCenter VM is running beforehand and connect to this node with the ESXi Host client. During the vCenter upgrade you can monitor the upgrade and boot process in the VM Console. After starting the vCenter, you can check in the VAMI if all vCenter services are starting and log back into the vSphere Client once the service has started.

To be able to monitor the upgrade and boot processes of the VxRail nodes, you can connect to the
iDRACs. During the iDRAC upgrade, the connection to the iDRAC is temporarily lost. When the iDRAC has been restarted (it can be monitored e.g., with the help of a continuous ping), you can log in again.

In certain cases, for example if you have not upgraded the vSAN Witness Appliance first or have not chosen the VxRail option to do it for you automatically, you must do it now manually. And after that there will probably need to be done another manual task – namely the vSAN Disk Format Upgrade.

Finally, if you use VMware vRealize Log Insight as your syslog server (internal vCenter only) you should not forget to manually upgrade it too.

// Post-Checks

If the upgrade has run successfully and all subsequent health checks are good, the reader’s systems and their applications or services running on the VxRail cluster and all physical servers accessing a service of the VxRail cluster (such as vSAN iSCSI Target Service, File Services or Remote Datastores) should also be gradually checked for functionality.

Also please check immediately the (3rd party) vSphere- and vCenter-integrated solutions (2.1.1.) if they are still working like they should.

Support

Increased attention and careful checking of the monitoring solution and the backup systems are essential after the upgrade. If unexpected problems occur that cannot be solved by yourself, various support options are available. 

// Dell Support

For hardware and software support as well as maintenance cases, the Dell Customer hotline is available around the clock.

A Dell Service Request can alternatively be opened directly via the VxRail menus in the vSphere Client or online via the Dell Support Portal (registration required):
https://www.dell.com/support/home/en-us

The following information must be provided to Dell for a Service Request:

  • customer information:
    Dell EMC Site ID (customer number at Dell EMC), company name, address where the product is installed, name, phone number and email address.
  • product information:
    For Dell hardware platforms: Serial Number and Product Name
    For Dell software: Host ID or product name and product version
  • problem description:
    Error description, error message, logs, etc.

// Some Helpful Links

Dell Support Knowledge Base
https://www.dell.com/support/home/en-de?app=knowledgebase

Dell VxRail Event Code Reference
https://dl.dell.com/content/docu91469

VxRail 7.0 useful log file information – by David Ring
https://davidring.ie/2021/01/14/vxrail-7-0-useful-log-file-information/

If you have read up to this point, I hope my article was helpful to you. Feel free to share if you like…


// footnotes:

Date: 04.04.2023
Version: 1.2

1 thought on “VxRail Upgrade Guide”

Comments are closed.