Tuesday, 30 August 2016

How to prove your Ratpack app is non-blocking.

Using the Ratpack framework can be tricky. It's easy to get something working but it's harder to ensure that your code is non-blocking.  This post will detail how to prove your code is non-blocking with automated tests.

Background - What's the problem?

In order to realise the performance benefits of the Ratpack framework you have to ensure that your code is non-blocking.  This means ensuring that your Java Threads are always busy and not waiting for I/O (i.e. non-blocking).  As soon as your thread kicks off an I/O task (e.g. sending a http request or making a DB call), it should be able to switch to another task.  Once the I/O eventually completes an event is fired which informs that the next step in the code can be performed, typically doing something with the result of the I/O task.  This means that a huge number of tasks which are I/O bound (i.e. spend a significant amount of time waiting for I/O) can be performed in parallel with only a small number of threads.  This approach can have significant CPU and memory benefits.  However, it comes at the price of increased code complexity.


In my opinion, the biggest problem is where you accidentally make your code blocking i.e. the Thread is blocked waiting for the I/O to finish and doesn't do anything useful in the meantime.  Just because you are using asynchronous libraries, there's nothing stopping you calling a blocking method by mistake.  In this situation you go back to the traditional synchronous model whereby you can only hope to achieve more parallel tasks being executed by adding more threads to your application.  What makes this situation worse with a Ratpack application is that you have invariably assumed non-bocking code and have only allocated a small number of threads to your application (the default is two threads per CPU core).  This can result in a drastic decrease in throughput.


These bugs can easily sneak through into production for several reasons.  Firstly, you can make your code inadvertently block a thread for a variety of very subtle reasons.  You don't even need to call any blocking methods yourself, to introduce blocking code.  As an example, see this code which calls Cassandra - it makes any thread which subscribes to the returned Observable block on I/O.  Secondly, when the offending blocking code is invoked from tests, unless under load, it will not only work fine but respond with the same latency as if it was non-blocking.  In this scenario, only performance tests have the potential to stop your bug going live.


How to prove your code is non-blocking with tests


In short: demonstrate that your code can handle more I/O bound transactions concurrently than it has threads. Let's assume a web application that uses standard blocking I/O, has four threads, and  perform the following:
  1. Takes a http request from a client.
  2. Makes a call to a downstream service over http which takes three seconds to respond.
  3. Once the response has been received from the downstream service, returns a response to the client.
Assuming a blocking model, the absolute best throughput that could ever be achieved is as follows:

(Time Period / (I/O Delay)) x (Number Of Threads) = Maximum throughput

Or in the case above.

((60 seconds) / (3 seconds for I/O to complete)) x (2 Threads) = 40 Transactions per minute (TPM)

In practice, the throughput would not be as high as this since other time would be required for dealing with the client's request and response.

When we consider a non-blocking I/O model, the above formula should no longer hold true since we are no longer limited by the number of threads.  

It's up to our tests to prove that the above formula does not apply to our code.  This can be done by sending N number of requests at the code simultaneously and ensuring that the throughput exceeds the maximum throughput if it were blocking.  If it doesn't exceed the throughput - fail the test!

Writing the test  


I'm a fan of writing end to end tests that prove your code can make actual requests over the network to stubbed dependencies.  Wiremock is a great tool for stubbing and mocking http services and is ideal for this test.  

We need several things in our test as detailed below:-
  1. I/O delay - To simulate a delay in our downstream http service, we can use wiremock's fixed delay feature for simulating slow responses.
  2. Constrained number of threads - We will set ratpack to have one thread only.
  3. Send simultaneous requests to our web application.
  4. Timeout - We will set the timeout to the I/O delay multiplied by the number of simultaneous requests.  If this timeout is exceeded, we haven't proved our code is non-blocking.  This can be specified quite neatly in junit tests with an annotation.
The code:
@Test(timeout = SLOW_ENDPOINT_DELAY * NUMBER_OF_CALLS)
public void handlerIsNotBlocking() throws Exception {

    URI uri = new URI(getAddress().toString() + "happy");

    List<Response> responses = new ConcurrentExecutor(
            () -> jerseyClient().target(uri).request().get(), NUMBER_OF_CALLS).executeRequestsInParallel();

    assertThat(responses).hasSize(NUMBER_OF_CALLS);
    responses.forEach(this::verifyResponseHasCorrectContent);

}
 
See github for code.


Results


When this test is run, the following happens:
  1. Eight requests are sent simultaneously to the ratpack application via a "ConcurrentExecutor" a class I created which wraps an ExecutorService. 
  2. Eight requests are sent to wiremock. 
  3. After a three second delay, all eight responses are received from wiremock. 
  4. Ratpack then returns all eight responses within around ten milliseconds of each other. 
  5. The junit test verifies that all eight responses were received ok. 
  6. The test completes quicker than the timeout, proving that the code is non-blocking. 
This same technique can be applied to other services such as Cassandra using Stubbed Cassandra. 

Wednesday, 6 April 2016

The case for keeping firewalls simple

Internal firewall rules that attempt to analyse anything higher than the network layer can cause huge problems. In this post I'll make the case for keeping your firewall rules simple.


The problem we had


Our team recently encountered an error where an internal web application received a socket timeout when trying to call one of its internally hosted dependencies.  Whilst investigating, we found that the application had made successful HTTP calls to the same service, immediately prior to the error.  



It was puzzling but I ruled out anything Network related in our investigation given:

  • The application could make some requests absolutely fine.
  • There was nothing seemingly different about the requests and responses.  They were all GET requests that returned a small amount of JSON.
  • There weren't any connection errors.

All signs pointed to the server taking too long to respond, i.e. an application issue. We then found that the system being called had no record of even receiving the failed request. It was very puzzling.  One member of the team recommended we talk to Ops about the issue.  I was convinced not to bother them as the problem must be with the application.  I was wrong!

We then found out that (through talking to Ops) that the firewall was blocking the request on the basis that the contents of the request was deemed suspect.

We wasted a lot of time investigating completely incorrect theories based on seemingly sound, but invalid assumptions.



The problem in general


How to ever know if an error is firewall related


Our problem manifested itself as a socket timeout.  How would other non http based protocols report a blocking of traffic?  I can easily imagine going through the same long learning process for a database, an FTP or an SMTP service.

Confusion is introduced - even if it never fails again


Let's assume that these problems are addressed and the logic is updated to handle the legitimate requests.  Let's also suppose that the firewall never blocks a genuine request again.  When a socket timeout error is encountered, we could now point the finger at the firewall when we should be focusing on the application.

Recommendation


I'd advocate a simple firewall for internal traffic that whitelists IPs and ports only.  If we can be certain that the firewall completely trusts traffic based on an established TCP/IP connection, things will become a lot simpler to debug.  

The simpler to debug, the quicker you fix your site in an emergency!  Time is of the essence.

If you must...


If there is an absolute requirement that these firewall rules are in place, confusion can be mitigated by performing the following:

  • Ensure that all developers are aware of how firewall issues may present themselves.
  • Provide a console for everyone to easily see if traffic is being blocked by the firewall or not.

Tuesday, 8 March 2016

Ratpack talk - The story so far

I recently gave a talk to developers at Sky where I am currently working with Energized Work on the subject of our team's experience with Ratpack.



Apologies for the technical issue half way through (and also for the huge number of erms and ers). 

Slides: http://www.slideshare.net/PhillBarber/ratpack-the-story-so-far




Thanks to everyone who came along at Osterley.





...thanks also to the people watching from the Leeds office.

Tuesday, 23 February 2016

Lesson learned with Ratpack and RxJava

We recently encountered an issue with integrating Ratpack with RxJava's Observables.  This post will detail what the error was, how it was broken and how it was fixed.

The code 

Essentially, our handler code was as follows (NOTE: see git project for this class and also a test showing the behaviour of the broken code):

@Override
public void handle(Context context) throws Exception {

    Observable<String> contentFromDownstreamSystem = observableOnDifferentThreadService.getContent();

    contentFromDownstreamSystem.subscribe(response -> {
        context.render("Downstream system returned: " + response);
            }
    );
}


The code above retrieves an Observable from a service.  As implied by the variable name, the Observable emits items on a different thread when it is subscribed to.  The Obsevable I have used to illustrate this issue is very simple and can be found here.  In our real world example, our Observable represented a ResultSetFuture from Cassandra which would emit items after a (very short) period of time.  

The errors


No response error


The first error we encountered was as follows:

[2016-02-17 08:39:16,812] ratpack-demo WARN  [ratpack-compute-1-2] r.s.i.NettyHandlerAdapter - No response sent for GET request to /observable-different-thread-broken (last handler: com.github.phillbarber.scenario.observablethread.ObservableOnDifferentThreadHandlerBroken)


Despite the fact our Observable was emitting items that would take some time to be emitted, the above error occurred seemingly immediately after our handler had completed.  This also resulted in a http 500 error response issued to the client.

Double transmission error


It gets better!  Not only did we get an error indicating no response, we then saw an error implying we had tried to send two responses as follows:

[2016-02-17 08:39:16,821] ratpack-demo WARN  [Thread-4] r.s.i.DefaultResponseTransmitter - attempt at double transmission for: /observable-different-thread-broken
ratpack.handling.internal.DoubleTransmissionException: attempt at double transmission for: /observable-different-thread-broken

The Problem


The problem here is that Ratpack is not aware that the request is dependent on the Observable emitting items.  In other words, the request's Execution does not contain a reference to the Execution segment which represents the Observable's success action (the lambda passed to the subscribe method).  Since it seems to ratpack that there s no further work to do, ratpack's NettyHandlerAdapter detects that no response has been sent and issues the "No response sent for request" error and issues an error response to the client.

The final twist is that eventually our Observable's action is completed.  When it tries to write a response, it can't as the response for the request has already been committed.  This is why we get the "double transmission" error.


The fix - Convert your Observable to a Ratpack Promise


We need to ensure that the request's Execution has a reference to the Execution segment of our success action.   This is done by converting the Observable to a Ratpack Promise and activating it as follows (see fixed code in git here and a test here):

@Override
public void handle(Context context) throws Exception {

    Observable<String> contentFromDownstreamSystem = observableOnDifferentThreadService.getContent();

    RxRatpack.promise(contentFromDownstreamSystem).then(response -> {
                context.render("Downstream system returned: " + response);
            }
    );
}


If you read the ratpack documentation, this will seems obvious and you might wonder why we made this mistake in the first place.  However we were tricked into thinking that the broken code would work due to some very subtle ways in which the broken code can actually work.  The broken code will work just fine under the following scenarios:

  1. If the Observable returned by the service was converted from a Promise e.g. a Promise returned by the Ratpack httpclient.  That way the Promise will have been activated (or rather the execution segment added to the execution) indirectly by some other code and not explicitly by the Handler.
  2. The Observable synchronously emits items on the same thread.  Not sure of a real life example as to why you'd do this but it can occur during your testing when mocking and stubbing.
Our conclusion is, that when using RxJava with ratpack you should always convert to a promise in your handler layer.  You should do this even if you don't need to (as descibred by points one and two above) so as to play it safe incase the implementation of the Observable changes in the future.  

Why are we using RxJava


When we decided to use Ratpack, we wanted to avoid depending on it throughout our entire codebase.  If all of our services dealt returned Ratpack Promises, we'd have an even bigger job on our hands if we decided to switch frameworks.  It was hoped that using RxJava would decouple most of our code from Ratpack.  

Even with the extra learning curve of using RxJava, this seems reasonable as typically a web framework will only be referenced from your code in the front end, web/controller layer. It seems a bit of an anti pattern to do depend on it throughout the entire code base.

Summary


You only understand how things work when things go wrong.  This was a great problem for us as a team to figure out since it taught us about the intricacies of how Ratpack actually works.

  

Sunday, 10 January 2016

Choosing between Ratpack and Dropwizard.


This post will discuss our team’s approach to evaluating the java based web framework ratpack as an alternative to dropwizard for creating a new microservice.  This post assumes a basic prior knowledge of both ratpack and dropwizard.

Background


The team I'm on is full of developers all with experience of creating and maintaining applications built with the Dropwizard framework. It’s safe to say that Dropwizard would be the team’s default preference when considering frameworks for web based java apps. However, we have recently been tasked with building a new microservice that is as efficient as possible. Although this non-functional requirement isn’t 100% specified, we do know it will need to be capable of high throughput and low resources (i.e. memory and CPU). High costs from the company’s PAS provider have in no doubt influenced this requirement.

Ratpack is said to allow you to create "high performance" services that are capable of meeting these non-functional requirements assuming that your application is I/O bound.  It's because of this that the team evaluated it.

Dropwizard vs Ratpack performance


A Dropwizard vs Ratpack performance test was conducted to see if the reputed performance benefits could be observed for our use case.

Two webapps that performed the same steps (detailed below) were created, one with dropwizard the other with ratpack.

Architecture Diagram




Load Profiles  


Three load profiles were given to each application as detailed below:
  1. 5 minute duration, 100 JMeter Threads
  2. 5 minute duration, 200 JMeter Threads
  3. 5 minute duration, 300 JMeter Threads

Results


The third test run resulted in the Dropwizard application eventually failing to respond.  


The graph above shows that throughput for the first two test runs was very close.  The third test run shows Ratpack answering significantly more requests per second due to the fact that the Dropwizard application began failing to respond.  Exactly why the Dropwizard application gave up was not investigated. 





Here we see ratpack using significantly less memory in all but the last test run. 






Ratpack seemed to use around half of the CPU as Dropwizard.




Here we see the true nature of our non-blocking frameworks in action.  Ratpack used just twenty threads in each test run.  Knowing the basic premise of non-blocking I/O, this shouldn't come as a surprise - however I still find this very impressive!



Performance Summary

Ratpack can clearly handle more requests with less resources in an application that spends most of it's time waiting for I/O.  This was of major significance to us since we knew from the start our application would be I/O bound.  It's also worth pointing out that (at least in my experience) I/O bound webapps are very common.

Other concerns


In order for the organisation to support our new service, it needs to comply with a few standards.  

Logging

A common design design pattern in a microservices architecture is to assign some kind of UUID to each request which is then passed to downstream systems.  This  UUID can then be added to log events so that they can be correlated across multiple systems.  This can (and has at our organisation) been implemented using SLF4J's Mapped Diagnostic Context (MDC) which "...manages data on a per thread basis."  However, it should be noted that " a server that recycles threads might lead to false information".
Ratpack recycles threads, but luckily were covered an MDCInterceptor has been created which addresses this very problem.

Deployment


The organisation relies on the simplicity that comes with building and deploying Java based services as Uber/Shaded/Runnable jars.  Since ratpack applications can be built in just the same way, this wasn't an issue.

Integration with Hystrix


Hystrix is a great library that helps you make your services fault tolerant.  Not only is it great, it's become the unofficial standard within our organisation.  Any java based application can integrate with Hystrix in many different ways since it's so flexible.  Hystrix supports synchronous, asynchronous and reactive programming models.  Not only does Hystrix support non-blocking calls it also supports the use of semaphores instead of Thread pools to manage concurrent downstream calls.  The flexibility and support offered here by Hystrix is absolutely crucial!  If Hystrix mandated the use of a thread pool or blocking calls, we'd be back to the Dropwizard performance characteristics (shown above) of having one thread per concurrent request we handle (assuming each request results in a downstream system call, which is true in our case) which would negate ratpack's benefits almost entirely.

Not only is it possible to use Hystrix with ratpack, it's also easy to use features such as request caching, request collapsing and streaming metrics to the dashboard (which is awesome by the way) with the ratpack-hystrix JAR.

Complexity (Learning Curve)


The benefits of ratpack come at the cost of complexity.  Our team (myself very much included) are used to building traditional synchronous based Java apps.  The effort of adopting an asynchronous programming model was difficult for the team to assess but was certainly a known risk which could affect delivery time.

Decision Time - Ratpack or Dropwizard


With the amazing performance from our benchmark, a sense of optimism that we’d pick up the new framework and asynchronous programming quickly... the decision to go to ratpack was made.



My next post will detail the team's experience in general with Ratpack. 

Thanks to the team for all of the shared learning so far, and especially to @mirkonasato for leading the performance benchmark work detailed above.

Saturday, 31 October 2015

Ansible's for loops

Recently, me and a team hit a big limitation with Ansible that really took me by surprise.  For loops, can only iterate over a single task and not roles or playbooks.  This post will discuss why it was a surprise, what we did to get around it and (briefly) what Ansible 2 introduces.

Pre Ansible 2


The original issue really took me by surprise since the documentation on for loops showed a huge number of possibilities (including conditionals, looping over arrays etc etc) that suggested to me it was Turing-Complete and easily capable of meeting all our looping needs.  So when I needed to assign a role multiple times (dynamically from a list) I was shocked to find that it couldn't be done.

I'm not the only one who was caught out by this.  See here, here and here.

Why this wasn't supported really puzzled me.  Some on my team concluded that it must be due to the way Ansible works internally and would perhaps require a lot of re-writing to support the feature.  However, Michael DeHaan said here that there was a "fundamental reason" it couldn't be done:

"...the role must be applied in the same definition (but not the same variables) to each host in the host group. Ergo, you can't have different numbers of roles applied to different hosts in the same group based on some kind of inventory variance."

But I don't understand why you can't.  The hack here suggests to me that you can.

Should the docs have told me what's not possible?


I also wondered if the ansible documentaion on loops should have mentioned its limitations.  Funnily enough, Michael DeHaan wrote an article on Technical Documentation As Marketing which said "your documentation and front material is there to both be honest, communicate how to use something, and also sell what you are doing".  The docs certainly didn't lie to me, but they could have been more honest about limitations.

Workaround

We looked at a few options:

  1. Hack - Way too ugly!
  2. Loop over each task within a role separately - This was not possible for us since some tasks depended on output from previous tasks. 
  3. Convert role (or set of tasks) into a single task - No appetite to learn and code more ansible specific stuff at a time when we felt so let down by it!
  4. Move logic away from ansible into something external that ansible calls (e.g. a bash script).  
We opted for the least bad, which to us was option four.  


Post Ansible 2

According to slide 15 here, in Ansible 2 you can loop over multiple tasks which is really good to hear.  However, it seems you can only loop over tasks, and not roles.  I don't understand why you can't loop over roles, but am really encouraged to see the feature included.  It's also great to hear that Ansible 2 is backwards compatible with version 1.  I look forward to using it. 










Tuesday, 15 September 2015

A Service shouldn't know about its Environment


Recently I was involved with setting up a build and deploy system for a suite of microservices. Each service was bundled into its own Docker image and deployed by Ansible.  One of the biggest lessons I learnt is the subject of this post.

Background


  • Project encompasses around twenty microservices (this figure only set to increase).
  • All packaged and deployed as Docker images.
  • Deployment by Ansible.   
  • Some services need to be deployed to different machines on different networks since some are public facing, some backend, some internal etc etc.
  • Different environment variables need to be injected into different running Docker containers.
  • Each service lives in its own git repo.
  • Each service built by CI server which runs project tests and creates a Docker image. 
  • Big effort to minimise the amount of work/config required to create a new microservice.

The Problem


Where to store the ansible deployment configuration for each service.

 

First idea: Each service owns its deployment configuration.


At first we had the idea of each service being in control of its own deployment destiny.  This involved each project storing its ansible config in its git repo.  I liked this since it meant that everything you needed to know about a service was in one place.  It was meant to be great for auditing and keeping track of changes since you only needed to worry about one project. 

It all sounded great in theory. However, we soon discovered it was in-practical for a number of reasons.

Making changes to deployment config is hard when it's defined in multiple repos.


The first sign this was a bad idea was that we found ourselves having to make changes in many different projects when we wanted to change the deployment mechanism. For example, say we wanted to run a command after every deploy, this would require changing the common Ansible role responsible for deployment and then updating each service to use the new version of that role. This would then require us to rebuild each project with the pointless step of building and tagging a new Docker image that was identical to the last (since no changes to the Ansible deployment config affected the Docker image for a given service). With the number of microservices set to increase, this problem was only forecast to get worse.

There's no central place which describes your environment.


Answering the question "What else runs on network/machine X?" becomes impossible as you have to check an ever growing number of service's repos.

Dependency is the wrong way around


If each project has its own deployment config, it is in affect describing a subset of the environment.  In our project, this seemed both impractical and inflexible.  In retrospect, a service knowing about its environment seems like a dependency the wrong way round. To explain this, consider the following relationships of the entities composing a typical Dockerized Java Web App.



The diagram above shows the relationships between various entities in two projects.  Each arrow can be read as "depends on".  This diagram also shows that the "Environment" can't be drawn as a single entity in its own right as it's distributed among multiple projects.

Looking at the above from right to left:
  • A library is imported by a webapp. 
  • A webapp is built into a Docker image.
  • The Ansible config deploys Docker Containers of specified Docker images.
  • The Ansible config describes a subset of the environment. 
The arrows that point from the Ansible config to the Environment feel out of place and wrong!  In fact, if any of the above left to right pointing arrows were reversed, we'd have a design problem:

  • If a library knew about the webapp using it, it wouldn't be reusable. 
  • If the webapp knew it was deployed in Docker, it would require re-work to deploy it outside of Docker. 
Since the Project has Ansible config that defines the Environment, it means that the Project is coupled to its environment and deployment implementation.  The same could be said for the project's relationship with Docker.  The webapp is coupled to Docker via it's corresponding Dockerfile which means replacing Docker with an alternative could be arduous.  However, this coupling seemed OK to us at the time since the Environment was in much more of a state of flux than our Docker containers.

Second idea: Deployment config in its own project


With this idea, we have each project only responsible for defining its Docker image and not how it's deployed.  No Ansible config lives in any microservice repo, it all lives in another project which is used to deploy everything.

The diagram can now be re-drawn as follows:



Here we see the Environment as a fully fledged entity with its own git repo.  The project knows nothing about where or how its deployed.  The project is only responsible for building Docker images.

After switching from the first idea to the second, we quickly found that this made more sense from a conceptual level and was easier practically.

Why this was better for us:



  • The Environment config could be understood in full by looking at one repo.
  • All Ansible roles that we had created were then moved into this one repo which also helped with exploring the config.
  • Making changes to the deployment config was far easier being in one place.
  • When changing the config, it was less likely to require effort that was proportional to the number of microservices.