Sane Change Management

This is a generalized change management document for making changes to a running 24x7 website. It can likely be extended and modified to use in any safety-sensitive environment where peer review or checklists might bring confidence to the change having an unsurprising result.

The Short of It

For the following categories of change, we'll have different amounts of sanity-checking and peer-review.

Examples of systems-level changes that might bring significant risk to the stability and resilience of the system:

  • Database Schema changes
  • Core Network changes
  • OS changes
  • php/apache/mysql/postgres/etc version changes
  • Other things I can't think of right now

For these changes, we'll have answers to the following questions, posted in a well-known place. The Ops on-call person will need to know what changes are happening and when.

  1. What is the change? (link to a task ticket or bug with all details, and an acknowledgement by a second engineer)
  2. What's it for, exactly? What problem does it solve?
  3. Has this change happened before? (is there a successful history ?)
  4. Who will be making the change?
  5. When is the change expected to start?
  6. When is the change expected to end?
  7. What is the rollback plan, if something goes wrong during the change?
  8. What is the test to make sure that the change succeeded?

The Long of It

When answering the questions above, here are some things to consider.

Who?
  • Who could be affected if the change fails?
  • Who is performing the change?
What?
  • What is affected by the change? (devices/hosts/network/etc)
  • How long is it expected to take?
    What's the priority of the change?
    - Urgent: This has to be done right now, and could affect the entire site.
    - High: This could affect a large number of users, and needs high priority (today)
    - Medium: This doesn't have to be done right now, but can be postponed (week)
    - Low: This has to be done, can be postponed indefinitely.
  • What will happen if the change is *NOT* made?
When?
  • When will this change be made?
  • When will it be finished?
  • When is "all-clear"?
How?
  • How will it be done? (phased deployment?
  • How will we verify it's successful?
  • How successful were similar changes in the past?
What If?
  • What is the "rollback" plan if the change should fail for some reason?
  • What is the worst possible outcome associated with this change?

The latter section is essentially a segue to a contingency planning or "Go/No-Go" document.