Notes and thoughts from Ansible meetup - July 2015

Here are my notes and thoughts on the Ansible meetup hosted by Equal Experts on 16th July 2015.  I benefited hugely from the event since I got to here about a  technology that I thought I knew reasonably well being used in a completely different way than I had ever considered. 

Talk 1: Debugging custom Ansible modules - Alex Leonhardt

 

Unfortunately I missed this as I left work late :
His slides can be found here: http://www.slideshare.net/aleonhardt/debugging-ansible-modules

Talk 2: AWS, CloudFormation, CodeDeploy, and some Ansible fun - Oskar Pearson


Oskar Pearson talked to us about how the deployment of the E-Petition's website works.  The entire deployment config has been open sourced on github here so you can work out exactly how it has been done.  Oskar's full presentation can be found here but a very brief summary is shown below so that my thoughts have some background context.

Problem Domain:
  • Busy Site.
  • No permanent Ops team.
  • Security & Uptime is important.
  • Cost is a factor
  • Went AWS.
Principles
  • AWS isn't a hosting provider [it's a platform] - so do things the AWS way to get the maximum benefit.
  • Disposable servers.
  • Tried to avoid building things themselves.
Architecture details
  • ELB in front of webservers.
  • Webservers in auto-scaling group.
  • S3 storage used for code deploys.
Deployment details:
  • Ansible is the glue that holds everything together...  It builds servers, configures services (e.g. NGINX) and sets up environment variables. 
  • Releases done by copying tar file to S3 and then CodeDeploy invoked to deploy it.

What I found most interesting... Ansible with Disposable Servers

 

I have been familiar with the concept of disposable servers (also known as phoenix servers or immutable servers) and love the benefits they bring.  I have worked in an environment with snowflake servers in the past and seen all of the terrible problems they introduce.  That is why I became such a huge fan of Docker when I started working with it.

However, what I didn't appreciate (until Oskar's talk) was that it is both possible and practical to use tools such as Ansible in this way if the following statement holds true: Ansible only ever runs once on a given VM.  If this always holds true, you get all of the awesome benefits of disposable servers.  In the case of E-Petition's setup, this statement does hold true.

Talk 3: Ask Ansible - Mark Phillips 

 

Mark Phillips then took the floor to field questions from people using Ansible.  Not only did Mark clearly have a huge amount of Ansible knowledge, the room was packed with other experienced Ansible users who also chipped in when needed.

Do you need tests for Ansible roles?

 

First question..."Puppet has things like beaker to help with testing.  How do you test your Ansible config?".  Mark's opinion was that testing is achieved by simply applying the config and seeing if it works.  Mark went on to argue that testing roles (or some other config in isolation) has limited value since you won't know if the entire setup works until you apply it for real, to a real (test) machine.

I can understand this view and definitely recommend that end to end testing is performed, but I'm of the opinion that more is sometimes needed.

If an Ansible role is responsible for just installing a service - It's tempting to agree with Mark... it ether works or it doesn't and if it doesn't it should error in an immediately obvious way.  However, I think that Ansible roles (like anything in IT) can sometimes fail in nasty subtle ways that can go undetected for a long time.  The earlier these issues are detected the less damage they cause.  Also, the quicker they are identified the easier they are to understand and fix.  You are more likely to associate an issue with something that you changed five minutes ago than one from two weeks ago.

Ansible roles don't always obviously fail - they can fail silently



Suppose you have added the following line to a task...
ignore_errors: yes
 
... when you shouldn't have (perhaps due to a copy paste mistake).  In this situation a failure might not be detected immediately.  Or perhaps the ignore_errors was there for a certain expected error but actually fails for a completely different reason?  This has happened to me!  It's fair to argue that my role was coded badly (as my colleague delighted in informing me!)...  However, some tests for my role might have prevented this issue from escaping our QA process and lurking in production silently.

In the absence of specific tests for an ansible role, you have to rely on other tests catching your mistakes.  If these tests are reliable and honoured (i.e. no one deploys further up the pipeline if the tests are red) then I think that's OK.  However, I'd still argue for tests that target specific roles in order to quickly identify what is at fault.  Achieving this in practice is hard...

How to test ansible roles (if you wanted to that is)?


I actually wrote about this a while ago.  Essentially you can apply your ansible role to a docker container.  This in theory should work fine.  Sadly I have found problems with with this in practice.  Finding a docker container that mimics my target VMs distribution (CentOS7) with systemd running is non-trivial.  To be honest I can't remember all the issues I had, but I remember pairing with a colleague and eventually (sadly) giving up.  But the theory's sound!  Perhaps the solution to my problem is to test my ansible roles against something a little heavier than a docker container like a Vagrant provisioned VM.



Ansible with strict SSH host key checking?

 

I then asked Mark about the practicalities of using Ansible with strict SSH host key checking.  Essentially I wanted Ansible to verify the ssh public key of a target host from a pre-defined value.  I wanted to know if there was a better way than my current solution as detailed below....

  • Store the ssh public key for each host in the inventory (currently using flat file inventories) like so.... 
[web_host]
10.1.1.1 public_ssh_rsa_key="....the-key..."
  • Run the following bash script, which will generate a file named
    known_ansible_hosts....
#!/bin/bash
set -euo pipefail
IFS=$'\n\t'

ANSIBLE_SSH_KNOWN_HOSTS_FILE="${HOME}/.ssh/known_ansible_hosts"
INVENTORY_DIR="./inventories"

cat ${INVENTORY_DIR}/* | grep public_ssh_rsa_key | perl -ne 'm/"(.+)?"/ && print $1."\n" ' > "${ANSIBLE_SSH_KNOWN_HOSTS_FILE}"
  • Create an ansible_ssh.config file that references that file for its UserKnownHostsFile:
UserKnownHostsFile ~/.ssh/known_ansible_hosts
  • Create an ansible.cfg file which references the ansible_ssh.config file:
[ssh_connection]
ssh_args = -F ansible_ssh.config
 
...as you can see it's a bit long winded.

Answers to this question came fast... I'd guess about twenty brains were thinking of solutions and my one brain couldn't keep up with the proposals.

Oskar mentioned an ansible role on his github account which configures sshd on the destination server.  Although this is closely related, I don't think it's solving quite the same issue as my problem is getting the client to trust the server and not the other way around.  Anyone who does have a better solution... please feel free to comment!


Post Talk Conversations

Ansible's last job when setting up a VM - Disable SSH?  High security and the ultimate immutable server!


Me and a small group got chatting with Oskar afterwards and (amongst other things) debated the possibility of ansible disabling ssh as its last task.  This would in affect permanently restrict access to the VM and create the ultimate immutable server!  I think we both agreed that this would be awesome but would require a lot of faith in:

  • Monitoring - any live site issue would have to be debugged by your monitoring alone... no running telnet or curl from your live VM.
  • Build and deploy pipeline - You'll have to fix (or debug) the issue without manual changes and only using regular releases.
If disabling ssh sounds too risky for your setup (as it does for me currently) then  it could be used as extra motivation to improve the above list.  The next time I ssh on to a VM and do something manually - I will try and ask myself "How can this be automated so I can switch off ssh access?"  


All in all, this was a fantastic meetup! I learnt a lot and it really got me thinking about improving my deployments.  Thanks again to all involved.
 

Comments

Popular posts from this blog

Lessons learned from a connection leak in production

How to test for connection leaks

How to connect your docker container to a service on the parent host