In a world of always-on services, your service’s uptime is critical to you and your customers’ business. No matter how much developer discipline, redundancy, and testing you do, your service is going to go down at some point. When it does go down, the devops team is in a race against the clock to resolve the issue as fast as possible with the 9’s in uptime melting away with each passing minute.
At Bandwidth, we are always looking for ways to improve our processes and uptime. And implementing ChatOps did just that. In this post I’ll show how ChatOps can help add a 9 to your service’s uptime. ChatOps, a term coined by GitHub, is an approach to managing technical and business operations through a group chat room. Chat applications like Slack, HipChat, Skype, and others are pervasive communications tools used in software organizations. With a few add-ons, these chat applications can also be a great tool to increase your services uptime.
Four Phases of an Outage
A service outage can be broken down into four phases — detection, response, diagnosis, and resolution. Let’s look at each phase and how ChatOps can help each phase.
The detection phase is all about your monitoring. You want to detect a site issue as quickly as possible and notify the right people who can respond, diagnose, and resolve the issue.
ChatOps helps the detection phase by notifying the right people quickly. Create a chat room dedicated to monitoring alarms and add all of the right people to that room. Integrate your monitoring tools into your Chat application by sending alarms to the room. For example, at Bandwidth the Communications API team has all monitoring alarm emails sent to the catapult-alarm channel using the Slack E-Mail add-on. Individuals in the room can control how to best be notified when new messages arrive in the room.
The response phase is the time between when the outage was detected and the time it takes for someone to begin looking into the issue.
Using the monitoring alarm room created for the detection phase above, ChatOps can reduce the response time because the members of the room will be notified on their phones wherever they are located. With the right ChatOp add-ons, team members can begin looking into the issue right in the chat room by typing in appropriate commands for a chat bot. A chat bot is a custom add-on that responds to your commands. For example, at Bandwidth on the Communications API team we have a chat bot named catbot. Any member on the team, including non developers, can type in “catbot health” to the alarm room and get the state of over 100 alarms.