Forget Looking Back – Clone and Update Forward
Updates are inevitable. With good planning and preparation updates go smoothly. Without due care, updates can be fraught with challenges. Some simply install without consequence; others uncover long-undetected defects in applications and underlying software layers.
The existing active system disk is commonly updated in place. Present storage availability and costs have changed dramatically. Cloning the existing system disk and updating the clone is a far better option. If the update goes awry or problems are detected later the original system volume remains a fallback.
Planning for updates requires preparing for the worst and hoping for the best. If the update proceeds without incident, all the better. Preparation and planning are never wasted. We should always consider whether techniques from earlier technological eras should be re-evaluated in the context of present and projected technologies.
        Clone and update is not without precedent. Though not described in as
        many words, it is effectively the technique used for updating a 
        shared system volume in an OpenVMScluster while other OpenVMScluster 
        nodes continue normal operations.1 The 
        Installation and Upgrade
 manual refers to this 
        approach as a rolling upgrade,
 where a copy of the 
        system volume is updated standalone on one system then used to reboot 
        other OpenVMScluster members in sequence.
        
Classic OpenVMS upgrades begin with a system disk backup, followed by the update installation. The system is rebooted following successful update installation. If the update was installed on a shared system volume in an OpenVMScluster, a rolling reboot must be installed immediately.2 If the update installation does not complete successfully, or if there are other problems, restore from the backup and restart the system. There is no quick fallback.
The rolling reboot is mandatory when updating a shared system volume in an OpenVMScluster. Files shared by multiple systems have been updated, mixed pre/post update files adversely impact system stability.3 The time pressure of a rolling reboot can lead to problems.
        We should reconsider the reboot question from a different perspective. 
        Since we are updating a clone of the original system volume, we 
        can repurpose the approach used for an OpenVMS upgrade in the update 
        context. Rather than the image of a rolling wheel, visualize changing a 
        tread on a tracked vehicle, hence a tread reboot,
 
        rebooting one node at-a-time from the updated system volume over a 
        longer interval. During a tread reboot is that BOTH the original and 
        updated system volumes are always usable. There is no danger that an 
        external event, e.g., power interruption, hardware glitch, will result 
        in an uncontrolled adoption of an update. Tread reboots remove the time 
        constraints imposed by a rolling reboot using an updated-in-place 
        volume. 
        
There is never an interruption in operations if an update fails. No restore is necessary; the original system volume remains available for use on a moment’s notice. A server can revert to the previous system by a simple reboot.
The instructions for the VMSINSTAL.COM command procedure and the PRODUCT INSTALL command, describe the process as:
After the update has been installed, experience with the updated system may uncover additional problems.
Qualifying an updated system for production use is often an extended process. Qualification does not guarantee that problems will not emerge later. Some applications and underlying packages may operate correctly, others may not. The probability that a problem will be discovered declines over time. The probability of problems is not the same for all workload components.
Mixed-version and mixed-architecture OpenVMSclusters are common. Mixed- version OpenVMSclusters offer flexibility and leverage. Clone and update leverages mixed-version OpenVMSclusters to reduce risk and downtime.
If your OpenVMS system is not part of an OpenVMScluster, the only Alpha processor in the OpenVMScluster, or in the future, an x86-64, an emulator or virtual machine can be used to do the work of the update, with the updated system volume being transferred to the storage array for production use.
The sequence of backup, update, and possible restore has serious drawbacks. Restoring after a failed update takes time and is not without hazard. Overwriting the partially or fully updated system image destroys the updated disk, the evidence of the problem, and condemns the system manager to redo the same work at a future time.
The update procedures were written long ago, in a far different computing and storage landscape. We should always reconsider our update practices in the present landscape. The present landscape makes clone-update a highly attractive and cost-effective alternative.
        Software environments are not limited to operating systems. Most sites 
        have a portfolio, sometimes vast, of layered products. Each product may 
        need to be requalified after an update. This reality increases the 
        possibility that any qualification
 is inherently 
        conditional and incremental. 
        
We start with the upgrade sequence itself and examine how clone and update reduces costs, delays, and risks. Our goal is an update process with less risk, together with a faster and safer regression strategy. These concerns apply whether problems occur during the update itself or are detected later.
Traditionally, the primary precaution against problems is a backup of the system volume taken before installing the update. When mass storage was expensive, e.g., the 1980s, there was little choice. Disk drives were expensive, extra drives an unimaginable luxury.
Interruptions in system availability caused by a restore operation have serious consequences when systems operate 24x7x366.
The backup-update-restore sequence dates from that earlier technological era.
        Multi-terabyte (TB) disk drives now weigh less than a pound, fit easily
        in a hand, and often cost well less than US$ 1,000/each, a cost/bit 
        reduction of approximately 175,000x. It is now unusual to allocate an 
        entire physical disk volume as a system
 volume. Rather, 
        multi-terabyte physical volumes are connected to storage arrays or 
        controllers which carve physical space pools into far smaller logical 
        volumes. A space pool may be a single physical disk drive, or multiple 
        physical drives combined into a striped (RAID0), mirrored (RAID1), n+1 
        redundant (RAID5), or other approach.
        
OpenVMS system volumes (VAX, Alpha, IA64, or x86-64) comfortably fit on a 3GB logical volume.
The contemporary mass storage environment together and OpenVMS logical name facility are symbiotic. Cloning a 5GB logical volume takes mere minutes at present mass storage speeds. Cloning the system volume followed by updating the clone produces an updated system volume. The original system volume is unchanged and stands ready as a fallback.
The SYS$SYSDEVICE-rooted system logical names disconnect the system volume location from its role as a system volume. A good guideline is to never use any name other than the SYS$SYSDEVICE-rooted names when referencing the system volume. The SYS$SYSDEVICE and related names are defined in VMS$INITIAL-050_VMS.COM based upon the bootstrap device.4
Following that rule, there is no operating difference between $1$DGA1, $1$DGA786, DKB500, or other volume. System volumes are all fungible.
Reboot one server from the clone following the instructions in the relevant Release Notes. One may use a conversational bootstrap with STARTUP_P1 set to "MIN".5 Install the update on the cloned system disk. The original system volume remains pristine. If running an OpenVMScluster, normal operations can continue uninterrupted on the other nodes in the OpenVMScluster. If running a single OpenVMS system, retreating to the pre-update is a quick shutdown and a reboot, a matter of minutes. Modify SYS$MANAGER:SYLOGICALS.COM on the cloned system disk to point at the original cluster-wide system data files, e.g., SYSUAF.DAT, RIGHTSLIST.DAT.6 Note that at this point a single server is running using the updated system volume with no access to cluster-access mass storage.
After the update has been installed, verify that the updated system volume is undamaged:
In an OpenVMScluster, the OpenVMScluster is now operating as a mixed? version or mixed-update OpenVMScluster, with most of the nodes running the pre-update OpenVMS from the original system volume and a single, possibly virtualized, node running the updated version of OpenVMS from the updated cloned system volume.
The updated OpenVMS system volume can be validated at leisure. As confidence in the update increases, additional servers can be bootstrapped from the updated system volume. When the updated system volume is fully qualified, all remaining members of the OpenVMScluster transition to the updated system volume a small number at a time. The original system volume is still used as a quorum disk and the repository of cluster-wide shared files, as well as a fallback should it be needed.
The cautionary note in the OpenVMS Update Release Notes that one should cycle though a rolling reboot does not apply.7 The cautionary note relates to updating an active shared system volume. With clone-update, the pre-update system volume remains unchanged. Only one node was using the clone when it was being updated, and it was rebooted following the update.
        This is the same changeover as a rolling upgrade, without the time 
        pressure. Unlike a big bang
 cutover, if there 
        is a question, time can be taken to clarify the issue before 
        proceeding. In larger 24x7x366 environments, the less pressing time 
        scale enables significant flexibility and reduces risks.
        
If a problem does occur, reverting to the previous version is simplicity itself. Shut down the affected system(s) and bootstrap from the un-updated system volume.
Reviewing an actual example illustrates the benefits.
Consider a mid-size OpenVMScluster with a modest, less than 10 TB storage array.
The storage array provisions logical volumes:
The first step is to clone $1$DGA1 onto $1$DGA2 using BACKUP/IMAGE. In this context, one is not worried about log files which are opened for write.
Mount $1$DGA2 privately and reset the volume label.
| MOUNT $1$DGA2 I64VMS842L3 SET VOLUME $1$DGA2:/NAME=I64VMS842L3B | 
All node-specific system root directories, e.g., [SYS0], [SYS1], …, [SYSn] will be present on both $1$DGA1 and $1$DGA2. OpenVMS system parameter file(s) are properly set.
All references to the system device should use one of the SYS$SYSDEVICE- based logical names, so it is a simple matter to bootstrap a single server using STARTUP_P1 set to "MIN" from the clone, e.g., $1$DGA2 in the example.8 Following the MIN bootstrap, the server will be running as a member of the OpenVMScluster. The update can now be installed.
Once the update has installed, examine MODPARAMS.DAT. In some cases, OpenVMS updates/upgrades may modify MODPARAMS.DAT, necessitating a re-editing of MODPARAMS.DAT. After verifying MODPARAMS.DAT, run AUTOGEN and reboot the updated/upgraded system volume, initially using STARTUP MIN, and later with the standard bootstrap. Verify that VAXCLUSTER is set correctly.
When the new system volume has been validated, a treaded reboot switches other members of the OpenVMScluster to the new system volume. The benefit of a tread reboot is that a tread reboot eliminates the time pressure of a rolling reboot.
Done properly, this procedure limits downtime to a matter of minutes per server — the time required to shutdown and reboot each node. If nodes are set up appropriately, with at least two members providing each service, near 100% application uptime is achievable.
        Working with standalone OpenVMS systems incurs more downtime, but clone-
        upgrade results in less downtime risk than the classic backup-update-
        restore sequence. If no second processor is available, downtime can be      
        significantly reduced by employing an Alpha emulator (or in the future, 
        an x86-64 virtual machine) to perform the actual update), either 
        directly using a disk on the SAN or by transferring the disk to the 
        emulated/virtual machine environment and back. If a fallback to the 
        original system volume becomes necessary, the system can always be 
        restarted from the pre-update system volume. This 
        tread reboot
 is a gradually inclined ramp; 
        fast rolling reboots are effectively a cliff to be scaled. If a 
        problem is detected normal operations can be severely impacted.
        
Controlling instantaneous risk through eliminating cold cutovers reduces the potential for significant problems.
| [1] | VMS Software (2017, June) VSI OpenVMS Alpha Version 8.4-2L2 Installation and Upgrade Manual Section 5.5.2 Rolling Upgrade | 
| [2] | VMS Software (2018) VMS842L1_UPDATE-V0100 ECO Kit Release Notes, Section 2.2 Reboot Requirement | 
| [3] | ibid | 
| [4] | Gezelter, R (2020, December 3) OpenVMS STARTUP: Underappreciated Flexibility; [VMS$COMMON.SYS$STARTUP]VMS$INITIAL-050_VMS.COM | 
| [5] | VSI (2020, June) VSI OpenVMS System Manager's Manual, Volume 1: Essentials, Section 4.5.3. Booting with Minimum Startup | 
| [6] | [SYSMGR]SYLOGICALS.COM | 
| [7] | VMS Software (2018) VMS842L1_UPDATE-V0100 ECO Kit Release Notes, Section 2.2 Reboot Requirement | 
| [8] | [VMS$COMMON.SYS$STARTUP]VMS$INITIAL-050_VMS.COM | 
| Long: | http://www.rlgsc.com/blog/openvms-consultant/forget-looking-back-clone-and-update-forward.html | |
| Short: | http://rlgsc.com/r/20230109.html | 

