Add a 9 to uptime with ChatOps
In a world of always-on services, your service’s uptime is critical to you and your customers’ business. No matter how much developer discipline, redundancy, and testing you do, your service is going to go down at some point. When it does go down, the devops team is in a race against the clock to resolve the issue as fast as possible with the 9’s in uptime melting away with each passing minute.
At Bandwidth, we are always looking for ways to improve our processes and uptime. And implementing ChatOps did just that. In this post I’ll show how ChatOps can help add a 9 to your service’s uptime. ChatOps, a term coined by GitHub, is an approach to managing technical and business operations through a group chat room. Chat applications like Slack, HipChat, Skype, and others are pervasive communications tools used in software organizations. With a few add-ons, these chat applications can also be a great tool to increase your services uptime.
Four Phases of an Outage
A service outage can be broken down into four phases — detection, response, diagnosis, and resolution. Let’s look at each phase and how ChatOps can help each phase.
The detection phase is all about your monitoring. You want to detect a site issue as quickly as possible and notify the right people who can respond, diagnose, and resolve the issue.
ChatOps helps the detection phase by notifying the right people quickly. Create a chat room dedicated to monitoring alarms and add all of the right people to that room. Integrate your monitoring tools into your Chat application by sending alarms to the room. For example, at Bandwidth the Communications API team has all monitoring alarm emails sent to the catapult-alarm channel using the Slack E-Mail add-on. Individuals in the room can control how to best be notified when new messages arrive in the room.
The response phase is the time between when the outage was detected and the time it takes for someone to begin looking into the issue.
Using the monitoring alarm room created for the detection phase above, ChatOps can reduce the response time because the members of the room will be notified on their phones wherever they are located. With the right ChatOp add-ons, team members can begin looking into the issue right in the chat room by typing in appropriate commands for a chat bot. A chat bot is a custom add-on that responds to your commands. For example, at Bandwidth on the Communications API team we have a chat bot named catbot. Any member on the team, including non developers, can type in “catbot health” to the alarm room and get the state of over 100 alarms.
Chat bots greatly reduce response time since team members can respond using the chat app on their phone wherever they might be when the alarm is triggered. No more rushing home from the restaurant to a computer to respond to an alarm.
The diagnosis phase is the time it takes for someone to look into the site issue and determine what is wrong and what steps are needed to resolve the issue. This does not include the time to actually fix the issue. This is typically the longest phase of a site outage.
ChatOps can reduce the diagnosis phase with the help of a custom chat bot that automates diagnosis. Your custom chat bot can simultaneously query multiple systems analyzing metrics to determine what is wrong. For example, you can query AWS CloudWatch, run a SumoLogic query over your logs, and kick off an external monitoring test on your REST API from a single chat bot command from the convenience of your phone.
The resolution phase is the time it takes to apply the steps determined in the diagnosis phase to bring the site back up. This is the riskiest phase of an outage because you are modifying the live site. A mistake during the resolution phase can make the outage worse. Automating common resolution steps can mitigate the mistakes from manual steps. By adding chat bot commands for common resolution steps like restarting services, rolling back a deploy, or whatever makes sense for your service you can speed up the resolution phase and reduce risk of error.
ChatOps has additional benefits. First, those responsible for responding to outages have more mobile freedom from their computers. We had an alarm trigger one Friday evening after hours and I was able to determine it was a temporary blip from my seat at a restaurant on my phone. Prior to having ChatOps, I would have had to rush home to a computer. Second, by using ChatOps and a monitoring alarm room everyone can see what is going on with the outage. People who join later can quickly catch up by reading the thread in the chat room. Third, by creating a group chat room for alarms you greatly simplify creating the time line for root cause analysis (RCA). It literally becomes copy and paste out of the chat room. You have exact times when alarms fired, who responded, what was done to diagnose and resolve the issue. Prior to this we had to piece together email threads, timelines, recall conversations and commands executed to resolve the outage. With ChatOps, we have RCA one stop shopping.
To reach 3 9’s of uptime you have a little over 10 minutes for each phase. To reach 4 9’s you have a little over 1 minute for each phase, 5 9’s a little over 6 seconds a phase. However, in practice you do not want to spend equal time in each phase. Instead you want to minimize the detection and response phase so you have the most time to spend in the critical diagnosis and resolution phases. ChatOps, through the use of add-ons and chat bots, reduces each of the phases by allowing team members to receive notifications, respond, diagnose, and resolve issues from their phone. Start thinking about how ChatOps can add a 9 to your uptime and improve your DevOps team today. In the coming months, look for additional posts on writing your own chat bot for ChatOps.