Wednesday, 25 June 2014

Why run browser based acceptance tests as monitoring checks?

So far the services I have been involved with have had acceptance tests and monitoring checks.  For various reasons the tests and checks have been designed, developed and run in separate worlds.  This approach has lead to issues falling through the cracks and remaining undetected in live for too long.

Here I will explain our first pass at joining these two separate suites... running browser based acceptance tests as monitoring checks.

What we've done in the past... Separate tests and checks


Acceptance tests


A good suite of acceptance tests should test your service end to end, verify external code quality and run against real external dependencies.  Some acceptance test suites can tick all of these boxes without relying on browser based tests, this is great as much complexity is removed.  For other test suites, a browser is essential.  In past projects, the browser based acceptance tests have run against various non-production test/integration environments.  One project had them running in production however they were separate from the monitoring checks and not under the careful watch of the operations team.  The main reasons for this include:
  • Not having a virtual machine with full production set-up (i.e. highly available) that is capable of running a web browser.
  • Some test suites included tests that were deemed "impossible" to run in live.  This then banished the entire suite to non-live environments.


Monitoring checks


Good monitoring should be able to verify if the service is working and reveal underlying causes of issues when not working.  The monitoring checks we created in the past have used the framework Nagios and typically test the service's external dependencies at a low level (e.g. network connectivity, DNS) and high level (e.g. sending a http request to an api).  Where possible we also performed some simple acceptance tests, for example testing that the application returns an expected HTTP status code when requesting a given URL.  Verifying anything more than this can quickly become complicated especially when user journeys over multiple pages with javascript are involved.

The cracks between Acceptance Tests and Monitoring checks... 


With the monitoring checks and acceptance tests in place as described above, you're reasonably well covered.  However, take this as an example... what if a javascript bug in your user flow prevents users from completing their journey?  Your external dependencies are fine - no monitoring checks will fire.  Your acceptance tests won't save you if they aren't running, or being monitored, in production.

We have got around these cracks in the past by displaying current user traffic on a dashboard.  The idea being that if user traffic suddenly drops to zero we'd know something is wrong.  However, with some services it's very hard to define a threshold of user traffic which represents activity that is clearly abnormal.  If nobody has completed a given flow in an hour... is it broken or is it Christmas morning?  Setting the thresholds too aggressively can mean spamming your OPs team during low traffic, too conservative will mean it takes a long time to recognise an issue which reduces the value of having the check at all.

The New Approach


Joining up the two suites and running your acceptance tests as monitoring checks gives you far greater confidence that your service is working at any point in time.  I will create a separate post with the implementation details of this approach soon.


Disclaimer... 


I'll be honest, it's still early days for us with this approach and while we do have our tests running against production, we are currently only soft-live.  We haven't got Ops monitoring this yet.  However, I am confident this approach is viable and will have many benefits.  Hopefully I can update this blog with positive news and the team's learnings as we progress to hard-live.