Monday, 30 March 2015

Nagios vs Sensu vs Icinga2

Choosing a suitable monitoring framework for your system is important.  If you get it wrong you might find yourself having to re-write your checks and setup something different (most likely) at great cost.  Recently I looked into a few monitoring frameworks for a system and came to a few conclusions which I'll share below.

Background

The System
At the time of this investigation, the system (which has a microservices architecture) was in the process of being "productionised".  It had no monitoring in place and had never been supported in production.  The plan was to introduce monitoring so that it could be supported and monitored 24x7 with the hope of achieving minimal downtime.

Warning - I am biased
Before we get started, I have to acknowledge a few biases I have.  I have worked with nagios in the past and found it to be  bit of a pain.  However, this was probably due to the fact we created our checks in puppet which added an extra layer of complexity to an already high learning curve.  I decided to re-evaluate nagios because (a) we'd be creating our monitoring checks directly and (b) nagios has moved on since.
I think I might also be biased to favour newer technologies over older ones for no better reason that I'm currently at a startup who are working with a lot of new technologies.
  

Requirements

As follows:
  1. Highly scalable in terms of:
    1. handling complexity (presenting a large number of checks in a way that's easy to understand).
    2. handling load (can support lots of hosts with lots of checks).
  2. Secure.
  3. Good UI with:
    1. Access to historical alerts.
    2. Able to switch off alerts temporarily - possibly with comments e.g. “Ignoring unused failing web node - not causing an issue".
  4. Easy to extend/change.
    1. Ability to define custom checks.
    2. Ability to add descriptive text to an alert e.g. “If this check fails it means users of our site won't be able to...”>
    3. Easy to adjust alarm thresholds.   
  5. Good support for check dependencies.
    1. This is related to requirement 1.1) - Ideally the monitoring system will be able to help the user separate cause from affect.  When you have 100s of alerts firing it becomes hard to establish the underlying cause (see earlier post on this).  Without alert dependencies, the more alerts you add the more you increase the risk of confusion during an incident.  This is hugely important for your users when it comes to fixing a problem in 5 minutes instead of 30!

Nagios

Nagios is very popular, can do everything, but comes with several drawbacks.  For my proof of concept I extended Brian Goff's docker-nagios image. 

Pros:
  • Very popular so lots of support.
  • Huge number of features.
  • Good documentation.
Cons:
  • High learning curve due to its number of features.  This applies to both navigating the UI and writing checks.
  • Creating check dependencies is cumbersome as you have to reference checks via their service_description field.  This means you either use the description like an ID (i.e. not a description) or you duplicate your description in all the places you reference (depend on) your checks.
  • Creating checks with a check frequency of higher than 60 seconds involves a "proceed at your own risk" disclaimer  "I have not really tested other values for this variable, so proceed at your own risk if you decide to do so!" See here
  • UI feels old (at least it does to me).   I remember being very frustrated with the use of frames for the dashboard which mean it's hard to send people links and if you hit F5 to refresh, it takes you back to the homepage.

 

Sensu 

Sensu is a lightweight framework that's simple to extend and use.  I used Hiroaki Sano's sensu docker image to get my proof of concept up and running.

Pros:
Cons:
  • UI has a feature called stash which I don't understand and doesn't seem to be documented.
  • Dependencies can be configured however they seem to have little affect on the dashboard... see issue I raised on github here.
  • Documentation is great for a beginners guide and walkthrough.  I learnt the basics very quickly.  However I quickly found the need for a specification page which detailed exactly what the json could/could not contain.  This will be fixed soon hopefully: https://github.com/sensu/sensu-docs/issues/192 
  • Could not associate descriptive test with my checks e.g. "This checks for connectivity to the database which is required for..."

Icinga2

Originally a fork of the nagios project (and now a complete re-write), this framework has a huge number of features and a good looking dashboard.  I used Jordan Jethwa's icinga2 docker image 

Pros:

Cons:
  • Found the documentation hard to understand at first - this is related to the high learning curve.
  • Can assign notes / free text to alerts but the dashboard seems to only present it at quite a low level.  I couldn't find a way of customising the dashboard to display my "notes"  This could be customised at the dashboard level (via some feature I missed or is perhaps undocumented). 
 

Chosen Framework: icinga2

Sensu was sadly discarded because we felt there was a risk it would not scale in terms of handling the future complexity of the system.  This was mainly due to the apparent lack of support for dependencies.  The gaps in the documentation also made it feel like it wasn't quite ready to be adopted.  I'm definitely hopeful for sensu's future and look forward to seeing how it develops... it's definitely one to watch.

Nagios vs Icinga2.
They both have:
  • the same number of compatible plugins.   
  • lots of features at the cost of a high learning curve.
  • both handle dependencies (so should scale well in terms of complexity).
The differences:
  • icinga2 has a nicer UI - it feels more responsive.
  • dynamically creating objects and their relationships with conditionals (I think) should result in less boiler plate and copy pasted code which I have seen with nagios in the past.
Hopefully this will prove a good decision for the project!

Docker is great for these investigations

Being able to run someone else's docker image which gets an entire monitoring framework up, running and accessible in your browser in less than a minute is amazing.  Not only that, but you can also easily add/edit where necessary to get a feel for development.  It removes all the unwanted complexity in following setup guides.  Thanks to Brian Goff, Jordan Jethwa and Hiroaki Sano for their docker images.

If I missed anything...

Let me know.  I only had limited time and may well have missed some killer feature or another monitoring framework entirely that's way better than all of the above.

Thursday, 19 March 2015

Ansible - Sharing Inventories

Recently I needed to share an Ansible inventory with multiple playbooks in different code repos.  I couldn't find a way to do this nicely without writing some new code which I'll share here.

Trisha Gee sums up how I feel about this...  "I'm doing something that no-one else seems to be doing (or talking about). That either means I'm doing something cool, or something stupid."  Perhaps I missed an ansible feature, if so, please let me know!  I am new to it.  If not, perhaps ansible could add support for this.

EDIT 19/7/2015

I'm now of the opinion that this is a bad idea.  Regarding the Trisha Gee quote above - I now know what it was I was doing (hint... not cool).  

I am no longer sharing inventories as now all of our ansible deployment config (except a few roles) are in one git repo.  This same repo holds our inventories so no need for sharing.  If I did have to share them, I'd use git sub modules. 
The solution detailed here is not a good idea, but will leave it here anyway!

EDIT 30/3/2015

Git Sub Modules offer a better alternative... http://git-scm.com/book/en/v2/Git-Tools-Submodules
Not entirely sure (only just learnt about this) but thought I'd share.


Why do I need to share inventories?

 

We currently describe our architecture using static inventories.  One file per environment.  As per the examples on the ansible docs, we define things like [databases], [webservers] and [applications].  I want all of this in one file and one code repo so that it's all defined in one place.  The project that builds and deploys the web layer is separate to the application layer.  I did consider separating inventories by environment AND layer, e.g. test-web-inventory and test-application-invetory.  However, because the web nodes needs to know about the application nodes and likewise application to database that would lead to duplicating data. 

Why not use a dynamic inventory?

 

It feels too much too soon.  The architecture is currently quite simple and I don't want to introduce more complexity (and dependencies) into the build and deploy pipeline until I have to.  A flat file is reassuringly simple.  Once the architecture grows and the flat file becomes untenable, then I think it will be time for a dynamic inventory.

How I shared inventories

 

Summary

 

I put the inventory files within a shared role, that gets assigned to localhost only by the specific playbooks that need it.  That way, once we have installed the playbook's requirements (using ansible-galaxy install), we have the specific version of the inventory files on the file system.  Then we can run our playbooks referencing the freshly retrieved files.

That might sound ugly, but it was that or custom bash scripts that check things out of git.

In more detail

 

The shared-inventory-role I created looks like this:

├ files
├── dev-inventory
├── test-inventory
├── prod-inventory
├ meta
├── main.yml (standard for describing role)
├ tasks
├── main.yml (do nothing task - maybe not required)

Each -inventory file containing the full inventory that describes that environment.  This role is in it's own git repo.

The playbooks that then import this role (e.g. the playbook for the application servers) do so by referencing it in their requirements.yml file as follows:

- src: ssh://git@github.com/company/ansible-shared-inventory-role.git
  scm: git
  version: ac1c49302dffb8b7d261df1c9199815a9590c480
  path: build 

  name: shared-inventory

(note that I can reference a specific version)

Then, when we come to running the playbook, we simply install the playbook's requirements first as follows:

ansible-galaxy install --force -r requirements.yml

This forcefully installs the requirements (i.e. overwrites whatever is currently there) into the build directory.

Then we can apply our playbook with our inventory as follows:

ansible-playbook site.yml -i build/shared-inventory/files/test-env


Thoughts?

  • Have I missed a simpler way to do this?
  • Is this a feature that could/should be added to ansible?