Server Upgrades

 

Introduction.  Hopefully, the purpose of any server upgrade to get improved services up and running.  Limoncelli outlines the process as
  1. Develop a service checklist:
    1. What services are provided by the server?
    2. Who are the customers of each service?
    3. What software package provides which service?
  2. Verify that each software package will work with the new OS or plan an upgrade path.
  3. For each service, develop a test to verify that it is working.
  4. Write a backout plan.
  5. Select a maintenance window.
  6. Announce the upgrade as appropriate.
  7. Execute the tests developed earlier to make sure they are correct.
  8. Do the upgrade with someone watching/helping (mentoring).
  9. Repeat all the tests developed earlier.  Follow the usual debugging process.
  10. If all else fails, rely on the backout plan.
  11. Communicate completion/backout to the customers.

Step 1:  Develop a Service Checklist.  The service checklist is a tool that you use to drive the entire process.  The list should record

  • what services are provided by the host
  • who are the customers of each service
  • which software package provides each service

Spreadsheets are an excellent way to maintain such information since all computer users and most business people are familiar with them.  It is likely to be best to provide access to the file from the web, which can be done in a number of ways.  If you really need to double check the plans then hold a meeting.  E-mail is also an obvious choice for keeping people posted.

Including the end users as part of the decision planning process gives them a feeling of participation and control.  It is also likelier to get them to invest in the eventual outcome.

Usually each service is directly related to a single software package.  Sometimes a service is related to multiple packages, such as a calendar server that relies on an LDAP - Line Printer Daemon server.  These sorts of dependencies need to be documented.

It is also important to document the users of a service.  If they are people then they should be included in the process and represented.  It can also be important to document the users to ensure that the service is still required.  While it is likely to be extremely rare, there are going to be times where the service can be eliminated.

Step 2:  Verify Software Compatibility.  The next thing that needs to be verified is that particular software and services will be compatible with particular operating systems depending on what is being upgraded.  The optimistic place to start when determining compatibility is with the vendor.  But a we all know vendors are in the business to make sales and usually blame the other firm if something ends up being incompatible.  It can also be important to talk to other customers of the vendor who have similar configurations.  If worse comes to worse you can actually test things at your own site.

If you find out particular combinations of software and operating systems are incompatible then you may need to do one of the following.

  • Upgrade to a release that supports both the old and new operating system.
  • Upgrade an operating system to enable particular software to work.
  • Deal with the incompatibility in some other more special case approach

Step 3:  Verification Tests.  As each service is identified, a test should be developed that will be used to verify that everything is working together properly.  It is almost always best to have prewritten scripts to perform the tests.  One advantage of this is so they can be tested largely unattended.  In addition, a master script can be written that causes all of the smaller scripts to run and result in some sort of error messages.

But it may also be important to run these scripts individually test particular difficulties.  This sort of process is often called regression testing, going through all the options in what is hopefully an intelligent manner.

Sometimes these tests can be as simple as a classic "Hello World" example. Or printing some test pages around the network.  But in general, the tests are going to have to be set up appropriately for what is being tested.  Scripted tests can also be much more difficult, if not impossible, to develop for more elaborate non-command line sorts of software.  Then the sys admin has to do a lot more work by hand.

Step 4:  Write a Backout Plan.  While this will be covered in much more detail later, if something goes wrong you need to make certain you can at least back out to what was working previously.  If something small goes wrong then the usual debugging process will hopefully fix it. 

On the other hand, you can use up the entire maintenance window trying "just one more thing" to make an upgrade work.  It is important to have some sort of predefined time and/or other criteria for when the backout plan should be activated.  When the criteria for backout is reached then you have to do it in order to at least get back to something functioning.

Some systems, if they are quite smallish, can be backed up completely before attempting some upgrades.  It can be even easier to clone the disks and perform the upgrade on the clones.  Then if any problems that are insurmountable in the allotted time are encountered, the original disks can be installed.  Unfortunately, larger systems are typically much more difficult to replicate.  Replicating the system disks and doing incremental backups of the data disks may be sufficient in some cases.

Step 5:  Select a Maintenance Window.  It is important to come up with some sort of agreement as to a maintenance window.  But, in order to do this you need to be prepared for the upgrade and be able to make decisions based on relatively reliable information.

End users usually have a pretty good idea when they can withstand an outage.  Most organization systems are not needed at night or over a weekend.  Unfortunately, sys admins may not want to work these hours.  It can also be the case that the organization has systems that are required to be operating 24 x 7.  This can make selecting a maintenance window even more difficult.

The duration of a maintenance window equals

  • the time the upgrade should take
  • plus the time testing should take
  • plus the time it will take to fix problems
  • plus the time it may take to execute a backout plan

In many instances it may be important to double or triple your estimates to adjust for self illusions.  It is also important to remember you may get started late, meet with unforeseen difficulties such as technical issues or maybe weather or car problems.

Step 6:  Announce the Upgrade as Appropriate.  At this point, the upgrade needs to be announced to the end users.  It is very likely to be important to use the same format for all announcements so that end users can get used to them.  Several of the likeliest ways you can distribute these notices by e-mail, voicemail, paper memo, newsgroup posting, web page, note on door, or smoke signals.  It is also very likely important to have the message be brief and to the point.

One of the best ways to do this is to have a template that can be filled out.  This helps the process attain consistency.  This can also help to ensure that all the important or relevant information is included.

Step 7:  Test the Current Setup.  Right before the upgrade begins it is important to execute some tests of the existing system to help ensure you can get back to a worthwhile set of settings if you need to back out.

Step 8:  Do the Upgrade.  Many upgrades are too critical to be done solo.  Make sure you have certain backup and extra pairs of hands.  These sorts of upgrades are also a very important opportunity to have some of your more experienced sys admins interacting with the less experienced.  This can be fruitful in a large number of ways.  One of them being if it doesn't go well, it takes much less time to escalate it to someone with more knowledge and experience.

Step 9:  Test the Upgrade.  Now all of the tests that were done at step seven need to be rerun.  But, in addition, the upgrade is likely to require some additional tests.  It may also be the case that many of these tests need to repeated as the debugging process continues.

End users should also be involved in this step if reasonable.  This is likely to occur at some prearranged time and locations.

Step 10:  If Necessary, Execute the Backout Plan.  If certain predetermined criteria aren't met at certain times then you need to execute the backout plan.  This may happen even if the upgrade is quite far along.

You may choose to not back-out of all the aspects of the upgrade if some of them are compatible with the settings previous to the upgrade.  But it is important to have such issues completely clarified before diving into last minute decisions.

After the backout, the services need to be tested one more time.

Step 11:  Communicate Completion/Backout.  This is the step where the end users are notified about the extent of the upgrade or return to previous settings.

Just as there are plenty of approaches to announcing the maintenance window, these same criteria apply to communicating the results of the attempted upgrade.

The Icing.  As always there are likely to be some other possibilities that can be considered to be icing on a cake rather than essential.  Fortunately, what meets these criteria should almost certainly always be open for debate.

  • Add/Remove Services at the Same Time
    • sometimes a necessity
    • sometimes a luxury if it can be done
  • Fresh Installs
    • sometimes much better
      • system not so kluged up
  • Reusing the Tests
    • might be useful to integrate these tests into some sort of real time or periodic monitoring system
  • System Change Log
    • important to maintain a log of what has been changed on the server
  • Dress Rehearsal
    • perform a dress rehearsal on another machine with the same configuration relevant aspects
    • sometimes you might even do this and just swap machines if they are adequately similar and the upgrade is significant enough
  • Install Old and New Versions
  • Minimal Changes from the Base
    • sometimes it is best to put particular software in other partitions