Telstra, Human Error and Blame Culture
Posted on Feb 10 2016
Australia's biggest Telco, Telstra, suffered a pretty serious outage yesterday. For several hours, millions of customers had serious disruption to both voice and data services. In the face of understandable customer anger, Telstra's Chief Operating Officer apologised for the 'embarrasing' outage, and explained what had happened.
"We took that node down, unfortunately the individual that was managing that issue did not follow the correct procedure, and he reconnected the customers to the malfunctioning node, rather than transferring them to the nine other redundant nodes that he should have transferred people to," Ms McKenzie told reporters on Tuesday afternoon.
"We apologise right across our customer base. This is an embarrassing human error.
I took to Twitter, to express my incredulity:
This in turn led to a brief chat with Charis Chang, a journalist for news.com.au, and a short time later quite a few of my comments ended up on a piece on their website. It's not every day a tweet from me gets picked up by a national news website, so I thought it was worth following up with a blog post to try and explain why I reacted in the way I did.
First, we should deal with the first reaction that many of you probably had when reading this. 'One person caused all this?'. Yes, the mind boggles. Telstra seem to have a system which is not prepared for people to make mistakes. This is either a triumph of hope over reality, or just plain daft.
The implication is that a human being is able to make a change with no checks and balances that causes a huge, nationwide outage. For the nations largest telco, which is relied upon by many business across the country, 'embarrassing' doesn't quite do it justice. However, it isn't the only problem that is evident here.
We Don't Know Why This Happened, But We Know Who To Blame
Now let's see some platinum grade buck-passing in action:
"[The employee responsible] didn't follow procedures and clearly that's not a good thing but I wouldn't want to pre-empt the proper investigation and we'll figure out what the right response is when we've had a chance to dig into the detail." - Australian Financial Review
Lovely. So one person made a mistake which caused the outage, which the COO feels happy in stating publically. But at the same time claims there hasn't been a proper investigation. Clearly blame first and ask questions later is modus operandi at Telstra.
For the nations biggest telco to have a serious outage that can be caused by an individual is crazy. To rush out a bit of misguided PR which confirms this is even more nuts.
Hands up who thinks it is healthy for senior leadership to publically throw an employee under a bus. Sure, the manager in question wasn't named, but are you really telling me that people inside Telstra don't know who this person is? I wouldn't want to be them this morning.
Environments where senior management feel OK about singling out individuals are toxic. Blame cultures like this create a poisnous work environment. They lead to situations where mistakes cannot be admitted to, which in turn means that systematic issues go ignored. It hurts morale and hinders the creation of open collaborative workforces which can innovate and make companies more successful.
If I was working at Telstra, I'd be looking for another job right now.
It's The System
From what we have been told so far, an individual was placed in a situation where a simple mistake was allowed to have huge ramifications. Aside from the concern about how ready Telstra are to point blame, it speaks to a complete lack of understanding in Telstra's leadership as to how failures occur.
Singling out an individual to blame is perhaps common, but it is rarely if ever correct or healthy to do so. Most people who spend any time looking at the causes of failures understand that it is not the individuals that cause failures, it is the systems that they are part of.
As John Allspaw puts it, looking for a single source of failure is like looking for a single source of success. Success or failure is down to a number of factors, and can rarely if ever be put down to 'this person screwed up'. A system in which a single individual can make a simple mistake which leads to a serious outage like this is in and of itself seriously broken.
Homework For Telstra's Leadership Team
For examples of how to better look at failures in systems, and how to handle problems when they occur, I have some reading and viewing to recommend for Telstra's senior management. In no particular order:
- Sydney Dekker - System Failure, Human Error: Who’s to Blame?
- John Allspaw - Blameless Postmortems and A Just Culture
- John Allspaw - Each necessary, but only jointly sufficient