Alerts should treat your Ops team like the police... only contact them in real emergencies!

Every time an alert notifies your Ops team, there should be a real problem.  If there isn't a real problem, you're wasting their time, adding confusion and making it harder for them to respond to real incidents.

The problem we have: Too many alerts for non-issues and non-critical issues.

I sat with a member of our Ops team recently and was horrified to see how many notifications they received for our monitoring systems.  At times, it was almost impossible to make any sense from them due to the sheer volume of emails filling their inbox.  

Why is this such a problem?

Multiple reasons:

  • Each notification comes at a cost of lessening the impact of all other notifications.  Take this to the extreme, where Ops receive hundreds per day, the impact of an alert can be almost zero.
  • A false alarm is a distraction and can waste valuable minutes in debugging real issues.


How did we get here?

The short answer is: by diligently adding more alerts but not diligently reviewing their behaviour.  We (myself included) are guilty of the crime... Wasting Ops Time!  


Addressing the problem

This is not an easy problem to solve and will take time and effort to fix.  

Our Ops team is now (rightfully) putting pressure on the development teams to clear up the worst offending alerts.  They routinely send emails of the top ten "flapping" alerts (i.e. constantly cycling between problem and recovery) in an effort to reduce the spam.  Exposing the problem is definitely the first step towards solving it.

Review a weeks worth of your notifications... for each one ask yourself:

  • Does this alert reflect a real issue with our service that needs to be fixed?
  • Was this an issue that Ops really needed to know about?
  • What is the worst case scenario to end users given the alert?
  • Could the alert be safely set to only warn? (e.g. never notify, and only display a yellow/amber as opposed to red on your dashboard).

Some alerts will be a lot easier to fix than others, but you can easily prioritise by the number of notifications sent in a given period.  

Final point: Get included on your systems alerts (if not already), remove your email filters so they come straight to your inbox and feel Ops' pain.  Filters offer an all too convenient way of burying your head in the sand. 

Comments

Popular posts from this blog

Lessons learned from a connection leak in production

How to test for connection leaks

How to connect your docker container to a service on the parent host