As represented in the table above, the role of the
flight director, developing a coordinated plan, preparatory work,
communicating with the end users and/or customers, organizing the
operations of the window, and complete system testing are some of the
major issues that are likely to need to be addressed.
The following outline represents a bit more discussion
fills in other issues associated with the above stages and steps.
- Scheduling
- periodic, as needed or both
- must be worked out in conjunction with the
affected parts of the organization
- avoid product release dates
- end of fiscal year dates
- end of quarter dates
- maybe other seasonal constraints such as
summer or holiday impacts
- Planning
- individuals performing the maintenance need to
have very significant impact on the planning
- advance planning allows
- getting quotes
- investigating alternatives
- submission of purchase orders
- arrival of needed hardware/software
- lead times for these sorts of things can be
quite significant
- Flight Director
- needs to develop announcements for when to be
started and completed and what will happen
- needs to decide what work proposals will be
implemented
- needs to schedule work proposals
- decides much about staffing
- monitor the progress
- ensure effective testing occurs
- needs to be an experienced sys admin with
considerable respect from others
- needs to make judgment calls on things such as
levels of risk and timing
- needs a good overview of the site
- should avoid actual technical work
- Change Proposals
- need to be submitted by some deadline not too far
in advance to lose relevance, but far enough in advance to be
implemented
- What changes are to be made?
- What machines will require work?
- What are the pre-maintenance window dependencies
and due dates?
- What needs to be up for the changes to happen?
- Who will be affected by the changes?
- Who will perform the work?
- How long will the changes take?
- How much additional help is required?
- What are the test procedures?
- What equipment is required?
- What is the backout procedure and how long will
it take?
- The Master Plan
- Once proposals are frozen the master plan can be
determined
- What should go forward, what shouldn't?
- Need to determine dependencies and durations
- what needs to proceed what?
- Need charts and tables to specify the part of
each person in the plan
- Needs to be some slack in the schedule in order
to adjust to surprises and things that go wrong
- Disabling Access
- What sorts of access needs to be disabled or
discouraged?
- Place notices in appropriate places.
- Might make announcements through phone or public
address systems.
- Change helpdesk messages appropriately.
- Might require disabling all remote access.
- Mechanics and Coordination
- Very likely to have to consider dependencies for
shutdown and boot sequence
- critical servers
- console servers
- etceteras
- Console service
- need console service when reasonable
- coordinates access and authentication to multiple
machines
- limits interaction points
- Radios
- if site is spread out at all then need ways to
coordinate workers
- Deadlines for Change Completion
- Flight director needs to monitor progress of
individual tasks
- Flight director needs to determine backout times
and dependencies
- Comprehensive System Testing
- Sometimes choose to test many more systems than
were directly involved
- Need to reboot in appropriate order
- Need to validate service functionality
- Post Maintenance Communication
- Flight director needs to notify everyone
including end users what services are restored and/or improved
- Re-enable Remote Access
- Visible Presence
- Some sys admins are likely to need to work the
help desk immediately following
- Some sys admins are likely to need to make sure
they are highly visible and accessible for interaction and feedback
immediately following
- Postmortem
- Final assessment of success and what else needs
to be worked on after all of this
The Icing. Now
we'll address some other issues associated with maintenance windows that
may be considered to be extra.
- Mentoring New Flight Directors
- Need to select them far enough in advance to
prepare them.
- Trainee can do reasonable amount of the work
with mentor.
- Trending Historical Data
- Analyze how long particular tasks take to
improve estimates in the future.
- Pass along knowledge through some objectifiable
measures.
- Providing Limited Availability
- Might involve taking advantage of redundancies
in the system to sequence activities.
- Determine what is needed during maintenance
window by other users.
High Availability Sites.
By the very nature of their business, high availability sites cannot
afford to have large maintenance windows. It also means they have to
make the investments necessary to provide high availability. These
sites are also likely to have large numbers of redundant systems to help
improve reliability, performance and maintenance. Most of the
principles outlined earlier in this page, but we will review some
similarities and differences like Limoncelli.
- Similarities
- Schedule maintenance windows so that they have
smallest impact on customers.
- Need to let customers know when maintenance
windows will occur.
- make sure to limit this to appropriate
customers
- Planning and preparation are critical in order to
shorten maintenance window durations.
- Still need a flight director.
- Change proposals are at least as essential.
- The window needs to be tightly planned.
- The flight director needs to be very strict about
timing.
- Everything needs to be fully tested.
- Console servers provide considerable benefits.
- The sys admins need to have a strong presence
after the site/services are restarted. This should help them
find unseen difficulties and respond to them quickly, as well as
help create confidence in their work.
- A brief postmortem to discuss remaining problems
is almost surely going to be useful.
- Differences
- Redundancy is a necessity if high availability
is required.
- When working with redundancy it isn't necessary
to disable services. Must plan sequencing and scheduling
appropriately.
- Shouldn't require full shutdown, but dependency
lists and working within the machine dependencies is almost always
crucial.
- For e-commerce sites, since customers aren't
on-site it isn't very important to be physically visible when
system is back to fully operational. But clearly, being
available and responsive are likely to have desirable effects.
- A post-maintenance communication isn't likely
to be necessary for many high availability sites. You are
likely to try and make sure that customers are as unaware as
possible that maintenance is occurring.
- The flight director needs to make sure that
none of the scheduling and dependencies can take the service down.
They are also very likely to want to plan to enable particular
minimum levels of service.
- Actual availability of services will almost
certainly need to be monitored during the maintenance. Plans
for how to deal with any such failures are also likely to be
necessary.
|