Three Simple Questions
Posted on Mar 4 2015
At ThoughtWorks at the moment, I'm working with some of our internal teams and we're having a discussion around monitoring and alerting. A number of the systems we look after don't need 24/7 uptime, often only being required for small periods of time, for example around the end of month payroll systems. There are other systems we develop and maintain that are useful to our every day business, but again aren't seen as "Can't go down!" applications. This means that for a lot of our systems we haven't invested in detailed monitoring or alerting capabilities.
However, we really want to raise our game. Just because a system doesn't need to be up 24/7, it doesn't mean you shouldn't also clearly communicate when a system is down. We already do an OK job of alerting around planned downtime for many of our systems (although we are looking to improve), but we're more interested in getting better around the unplanned outages.
Now when the topic of monitoring and alerting comes up with delivery teams, especially when they are new to owning operational aspects of the applications they designed and implemented, this can seem like a daunting prospect. Both the technology landscape in this area and the number of things we could track are both large. The converstion investiably degenerates into discussions as regards to what metrics we should monitor, where they should be stored, which tool is best capable of storing them etc.
But, as a starter, I always ask the same few questions of the team when we start looking at this sort of thing. I feel they get fairly quickly to the heart of the problem.
What Gets You Up at 3am?
It's a very simple question. What would have to go wrong with your system for you to be woken up at 3am?
This is what gets to the heart of the matter. Talk about monitoring CPU levels fall by the wayside at this point. Generally the answer is "if the system isn't working". But what does that mean?
For a website it might need a simple ping. Or you could need to use a technique like injecting synthetic transactions into your system to minic user behaviour to spot when a problem occurs, something I detail in the book.
If the answer is "nothing", then this isn't a mission critical system. It's probably enough that every morning someone checks a dashboard and say "there's a problem!". Although the delivery team are far from the only stakeholders, as we'll discuss in a moment.
But if you are the unfortunate soul who has indeed been woken up in the early hours of the morning, there is another important question we have to ask.
When Someone Gets Woken Up at 3am, What Do They Need To Know To Fix things?
Telling them "What broke?" is obvious. But what do they need to do to find out why it broke, and what do they need to do to fix the problem? This is where you need drill-down data. They may need to see the number of errors to try and see when the problem started happening - seeing a spike in error rates for example. They may need to access the logs. Can they do that from home? If not, you may need to fix that.
This is where information which is one or two steps removed from an application failure is useful. Now is when knowing if one of your machines is pegged at 100% CPU is important, or seeing when the latency spike occured. This is why we need tools that allow us to see information in aggregate, across a number of machines, but also dive into a single instance to see what is going on.
The poor, bleary-eyed techie who got the wake-up call needs this information at their fingertips, where they are now.
Who Else Needs To Know There Is A Problem?
You may not need to be woken up at 3am to fix an issue, but other people may need to know there is a problem. This is especially true in ThoughtWorks, which is a global business with offices in 12 countries all over the world. For a non-critical system we may be completely happy for it to be down until the delivery team comes back to work, but we do need to at least communicate with people and let them know there is an issue and that it will be worked on.
Services like https://www.statuspage.io/ can help here. Automatically sending a 'serice down' notification at least helps people understand it's a known issue, and also prompts people to report a problem if the status page doesn't show an issue too. Things like http://www.pagerduty.com/ can also help, in terms of alerting out of hours support people to flag a ticket and indicate it's being looked at.
For a modern IT shop, even for non-mission critical software, being relentlessly customer-centric means communicating clearly with our customers via whatever channel they prefer. Find out who needs to know, and what the best is to get that information to them. They may be completely OK with you taking a day to fix an issue if you communicate with them clearly, but will be much less tolerant if they see nothing happening.
And The Rest
These three questions help drive out the early, important stuff around monitoring and alerting. It'll get you started. But there are a host of other things to consider. For example by doing trend analysis you may be able to predict failures before they happen. You may be able to monitor key customer-related metrics before and after the release of software that can help detect bugs in software. Or you could use all that lovely data to determine when your system will need to grow, or shrink.
You may also want to start thinking of game day excercises where you simulate failure and remediation mechanisms. This is a really interesting field, and I hope we can experiment with this ourselves more this year.
All that can wait though. Start with the easy stuff. Ask yourself three, simple questions.
Back to Blog.