OIT process for alerts and outage communications

Rutgers IT / IT Communications / OIT process for alerts and outage communications

OIT process for alerts and outage communications

The Office of Information Technology has established the following process for communicating about significant IT outages and other unplanned incidents or IT emergencies. The process is guided by these core principles:

Communicate ASAP, with frequent updates
Focus on end user impact
Do not wait for explanations of root causes

Messages are posted to the Alerts webpage at the Rutgers IT website. These messages and updates to them are automatically shared at the #alerts channel on Rutgers IT Slack. The messages are also shared at the @RutgersIT Facebook and Twitter channels.

Overview of process
Step 1: Help Desk learns of issue and posts to private Slack channel for emergency comm. Step 2: Team on Slack channel determines message to post. Step 3: Message is posted, with updates provided through incident resolution.
This process is designed to inform the non-IT community about major IT outages and other IT emergencies. With that in mind, we avoid explaining or debating technical details, using language unfamiliar to a general audience, or seeking to determine root causes. The emphasis is entirely on communicating to Rutgers students, faculty, and staff the impact of the problem—that a service is unavailable, degraded, or otherwise experiencing issues—and providing user-friendly, non-technical updates along the way to the issue’s resolution.

As soon as we know there is an issue, we want to communicate that—even in the absence of details about what is wrong and what’s causing it.

Additional guidelines for this process
In the interest of communicating quickly and effectively, we follow these guidelines:

We do not wait for confirmation from service owners before posting to the Alerts webpage. Once we have indications of issues with a service (reports from users or from distributed IT staff, for instance), we communicate via the Alerts webpage ASAP, without waiting for confirmation from a service owner. The Help Desk contacts the service owner to indicate a message is being posted. Because it is not possible to know an incident’s significance at the outset (or if it will last minutes or hours), we treat all incidents seriously and post ASAP.
Our initial messages may lack detail, and that’s OK. The initial message posted will often lack details and focus entirely on the user impact. For example: Degree Navigator is currently unavailable. OIT is investigating. Or, The Office of Information Technology Help Desk has received reports of issues with the Rutgers network. OIT is investigating. This alert does not indicate whether the issue is with the wireless network, with the wired network, or even if it’s about connecting to the network or accessing websites or resources from the network. It does serve an important goal: If someone is wondering about their network-related problem, they will be able to see that there is, in fact, some sort of issue, and they will know that we’re looking into it.
The Help Desk coordinates with the service owner. As soon as we are aware of an issue, the Help Desk alerts the service owner and seeks details and updates. Throughout the process, the Help Desk coordinates the flow of information about the incident. In some instances, the Help Desk may facilitate direct contact between the service owner and the IT communications team.
In general, we do not post about services aligned with or “owned” by other departments or units. We typically reserve our posts to OIT services (i.e., the network, NetID, email, etc.), with some exceptions.

Contacting OIT leadership

The Help Desk, service owner, and/or the IT communications team will determine if the incident may have an impact that merits immediate notification of OIT leadership (extensive and ongoing Wi-Fi outage, NetID issues with u-wide disruptions), as follows:

Any of these teams (Help Desk, service owner, IT communications) is empowered to make this determination and reach out to OIT leadership.
In practice, this will entail contacting the relevant AVP or VP (i.e., the one most relevant to the incident) and asking that individual to inform others in leadership. Alternatively, these teams may decide to contact all AVPs, VPs, and the CIO immediately by text.
It is understood that these teams often have to act on incomplete information. I.e., an incident appearing significant could easily turn out to be a minor incident.

Key roles and responsibilities

Help Desk

Monitors for issues
Initiates discussion on Slack channel for emergency communications
Coordinates and gathers information about issue, then shares on Slack channel
Posts alerts with the use of template messages and scripts

IT communications

Develops template messages and scripts for use by Help Desk
Collaborates with Help Desk to draft and post messages
Works with service owner and Help Desk to determine if additional communications may be needed beyond posting to the Alerts webpage

Service owner

Informs the Help Desk about known issues
Provides regular updates to the Help Desk