Maintenance Windows

 

Introduction.  What happens if you need a fairly long period of time for maintenance?  Though, it is a truth that even for many smaller maintenance and upgrade operations their needs to be a maintenance window.  A maintenance window should be a short period of time where a lot of systems work must be performed.

Limoncelli proposes that maintenance activities can be partitioned into three major stages with smaller steps within each phase.  These stages and steps are presented in the following table.

 

Stage Activity
Preparation Schedule the window
  Pick a flight director
  Prepare change proposals
  Build a master plan
Execution Disable access
  Shutdown sequence
  Execute plan
  Perform testing
Resolution Announce completion
  Enable access
  Have a visible presence
  Be prepared for problems

 

As represented in the table above, the role of the flight director, developing a coordinated plan, preparatory work, communicating with the end users and/or customers, organizing the operations of the window, and complete system testing are some of the major issues that are likely to need to be addressed.

The following outline represents a bit more discussion fills in other issues associated with the above stages and steps.

  • Scheduling
    • periodic, as needed or both
    • must be worked out in conjunction with the affected parts of the organization
      • avoid product release dates
      • end of fiscal year dates
      • end of quarter dates
      • maybe other seasonal constraints such as summer or holiday impacts
  • Planning
    • individuals performing the maintenance need to have very significant impact on the planning
    • advance planning allows
      • getting quotes
      • investigating alternatives
      • submission of purchase orders
      • arrival of needed hardware/software
    • lead times for these sorts of things can be quite significant
  • Flight Director
    • needs to develop announcements for when to be started and completed and what will happen
    • needs to decide what work proposals will be implemented
    • needs to schedule work proposals
    • decides much about staffing
    • monitor the progress
    • ensure effective testing occurs
    • needs to be an experienced sys admin with considerable respect from others
    • needs to make judgment calls on things such as levels of risk and timing
    • needs a good overview of the site
    • should avoid actual technical work
  • Change Proposals
    • need to be submitted by some deadline not too far in advance to lose relevance, but far enough in advance to be implemented
    • What changes are to be made?
    • What machines will require work?
    • What are the pre-maintenance window dependencies and due dates?
    • What needs to be up for the changes to happen?
    • Who will be affected by the changes?
    • Who will perform the work?
    • How long will the changes take?
    • How much additional help is required?
    • What are the test procedures?
    • What equipment is required?
    • What is the backout procedure and how long will it take?
  • The Master Plan
    • Once proposals are frozen the master plan can be determined
    • What should go forward, what shouldn't?
    • Need to determine dependencies and durations
      • what needs to proceed what?
    • Need charts and tables to specify the part of each person in the plan
    • Needs to be some slack in the schedule in order to adjust to surprises and things that go wrong
  • Disabling Access
    • What sorts of access needs to be disabled or discouraged?
    • Place notices in appropriate places.
    • Might make announcements through phone or public address systems.
    • Change helpdesk messages appropriately.
    • Might require disabling all remote access.
  • Mechanics and Coordination
    • Very likely to have to consider dependencies for shutdown and boot sequence
      • critical servers
      • console servers
      • etceteras
    • Console service
      • need console service when reasonable
      • coordinates access and authentication to multiple machines
      • limits interaction points
    • Radios
      • if site is spread out at all then need ways to coordinate workers
  • Deadlines for Change Completion
    • Flight director needs to monitor progress of individual tasks
    • Flight director needs to determine backout times and dependencies
  • Comprehensive System Testing
    • Sometimes choose to test many more systems than were directly involved
    • Need to reboot in appropriate order
    • Need to validate service functionality
  • Post Maintenance Communication
    • Flight director needs to notify everyone including end users what services are restored and/or improved
  • Re-enable Remote Access
  • Visible Presence
    • Some sys admins are likely to need to work the help desk immediately following
    • Some sys admins are likely to need to make sure they are highly visible and accessible for interaction and feedback immediately following
  • Postmortem
    • Final assessment of success and what else needs to be worked on after all of this

The Icing.  Now we'll address some other issues associated with maintenance windows that may be considered to be extra.

  • Mentoring New Flight Directors
    • Need to select them far enough in advance to prepare them.
    • Trainee can do reasonable amount of the work with mentor.
  • Trending Historical Data
    • Analyze how long particular tasks take to improve estimates in the future.
    • Pass along knowledge through some objectifiable measures.
  • Providing Limited Availability
    • Might involve taking advantage of redundancies in the system to sequence activities.
    • Determine what is needed during maintenance window by other users.

High Availability Sites.  By the very nature of their business, high availability sites cannot afford to have large maintenance windows.  It also means they have to make the investments necessary to provide high availability.  These sites are also likely to have large numbers of redundant systems to help improve reliability, performance and maintenance.  Most of the principles outlined earlier in this page, but we will review some similarities and differences like Limoncelli.

  • Similarities
    • Schedule maintenance windows so that they have smallest impact on customers.
    • Need to let customers know when maintenance windows will occur.
      • make sure to limit this to appropriate customers
    • Planning and preparation are critical in order to shorten maintenance window durations.
    • Still need a flight director.
    • Change proposals are at least as essential.
    • The window needs to be tightly planned.
    • The flight director needs to be very strict about timing.
    • Everything needs to be fully tested.
    • Console servers provide considerable benefits.
    • The sys admins need to have a strong presence after the site/services are restarted.  This should help them find unseen difficulties and respond to them quickly, as well as help create confidence in their work.
    • A brief postmortem to discuss remaining problems is almost surely going to be useful.
  • Differences
    • Redundancy is a necessity if high availability is required.
    • When working with redundancy it isn't necessary to disable services.  Must plan sequencing and scheduling appropriately.
    • Shouldn't require full shutdown, but dependency lists and working within the machine dependencies is almost always crucial.
    • For e-commerce sites, since customers aren't on-site it isn't very important to be physically visible when system is back to fully operational.  But clearly, being available and responsive are likely to have desirable effects.
    • A post-maintenance communication isn't likely to be necessary for many high availability sites.  You are likely to try and make sure that customers are as unaware as possible that maintenance is occurring.
    • The flight director needs to make sure that none of the scheduling and dependencies can take the service down.  They are also very likely to want to plan to enable particular minimum levels of service.
    • Actual availability of services will almost certainly need to be monitored during the maintenance.  Plans for how to deal with any such failures are also likely to be necessary.