Friday, 28 February 2014

Client Side vs Server Side Session

We recently looked at replacing our legacy session management system with a new one.  During this analysis, we came close to choosing a client side session but eventually concluded server side was better for us.  Here's why...

Client side session


In this model, all session state is stored in the client in a cookie.  The benefits of this are you don't need to worry about persisting and replicating state across nodes, session validation is lightning fast since you don't need to query any data store which means it's super scalable.  The session cookie must obviously be tamper proof (to prevent people creating a session of their choice) which is achieved by signing the cookie using asymmetric cryptography.

The signing of a cookie value uses the private key, the validation uses the corresponding public key.  Our idea was to try and keep the private key as private as possible by storing it in memory only.  Each node (4 shown below) would create a new private and public key on startup, with the public key being persisted in a replicated data-store.  This means (providing replication is working) any node is able to validate a session created (signed) by any other node.


Benefits of a client side session:

  • Low Latency - Validating and creating sessions is lightning fast as it doesn't need to hit the data-store.
  • Fewer points of failure - DB is only required on startup of each node.
  • Highly scalable since it's stateless (slight caveat of replicated public keys).
  • Possibility of distributing validation away from application nodes.  If the public keys are exposed via an api (provided by the nodes shown above), this means any other system can validate a session without having to make a synchronous call.  This makes session validation even quicker in a distributed system.

Cons of a client side session:

  • Sessions cannot be terminated (there are workarounds to this which I'll get to).
  • Logout action is not fully implemented - The session cookie can be dropped from the browser, but it would still work if resubmitted.
  • Implementation and user details are exposed since everything is stored in a cookie.
  • Cookie size is greater - Sounds trivial but can cause huge headaches (see below).
  • No 3rd party frameworks available (none in Java we could find anyway).

Terminating a client side session
This can be achieved by setting an absolute timeout on session create to a long period of time (say 1-2 days) which, when exceeded, will cause the session to be invalid and force the user to re-authenticate. 

Keeping cookie size to a minimum
This is a trade-off between complexity (in terms of parsing and maintaining sessions), security (strength of signature) and cookie size.  Maintainability is important given that your state will be living "in the wild" in your many clients' browsers.  If you ever want to add, remove or change an attribute in your users' sessions keep in mind you will need to support the two versions of your cookie as you deploy your changes.



The simplest thing to do with your data is to Base64 encode it (careful of the padding character, the default is a non-cookie friendly equals) using a format such as JSON or XML. Using such a format comes with the benefit of readability and maintainability. 

To get the cookie size down to a minimum, a list with a cookie safe delimiter (e.g. a dot) and effectively encoded/compressed could be the way to go

The more secure the signature, the more space it will consume in the cookie.  For example, signing 46 bytes of ASCII characters gives:

87 byte signature using a 512 bit key to sign.
172 byte signature using a 1024 bit key to sign.


Server Side Session


This is the traditional model which relies on storing all state in a replicated server side data-store.  The cookie is tiny as it only stores a reference to data stored on your server.

Benefits of a server side session:
  • Full control of session - Can terminate a session on demand instantly.
  • Existing frameworks can reduce amount of custom code to develop/support - See Apache Shiro
  • Cookie size is much smaller.
  • Implementation and user details are not exposed since only reference to session data is stored in the cookie.
  • Can store as much session related data without fear of increasing cookie size.
  • Implementation of session management can change easily since we store the session state ourselves and is not “in the wild” (i.e. not stored on many different browsers).

Cons of a server side session:
  • More points of failure - If the DB is unavailable, no sessions can be created touched or validated
  • More overhead in creating and touching session (a DB read, write and replication is required). This can be mitigated by applying the DB write asynchronously.
  • No future potential for applications to verify session without having to call nodes.

The points that swung the balance to server side


Session Timeout.  We decided that after 30 minutes of a user's inactivity, we are obliged to log them out if they have elected to not "Remember me".  Relying on the user closing the browser, we decided, is not enough.  This means that you need to store a rolling timeout which gets updated on each page view (this is in addition to the hard timeout discussed regarding session termination).  This requirement makes the client side session a lot more complicated since each time a page is viewed, the session must be updated, re-signed and dropped (Set-Cookie) back to the user.

Being able to kill sessions can be useful.
There are rare occasions when we want to kill a user session.  Sometimes this is a business reason, for example if we suspect that a user has shared their credentials with other people.   It is also reassuring to know we could terminate user sessions immediately in case of a security breach, e.g. hi-jacked accounts.  This just isn't possible with a client side session.  When you start trying to solve this issue with server side state it is my opinion that you have the worst of both worlds (i.e. the complexity of client side signing and server side replicated state).

Cookie size
If cookies get too big the impact can be huge, worst case your users see a blank page with something in your stack returning a 431 status code.  This is due to the request headers exceeding a certain limit.  How do you know when you'll exceed the limit?  How do you know what your limit is?  In a distributed system with multiple entry points, your maximum http header size might vary.  This could cause (seemingly) random pieces of functionality to fail for users with big cookies.  


On balance, the server side session is the solution for us.  However, if your organisation has different requirements, the client side session is worth a look.

Sunday, 9 February 2014

Lessons learned from a connection leak in production

We recently encountered a connection leak with one of our services in production.  Here's what happened, what we did and the lessons learned...


Tests and Monitoring detected the issue (silently)

Our automated tests (which run in production for this service) started failing soon after the incident occurred.  We didn't immediately realise since they just went red on the dashboard and we just carried on about our business.  Our monitoring detected the issue immediately and also went red but crucially didn't email us due to an environment specific config issue.

The alarm is raised

A tester on the team realises that the tests had been failing for some time (some time being 4 hours... eek!) and gets us to start investigating.

Myself and another dev quickly looked into what was going on with our poorly dropwizard app.  We could see from the monitoring dashboard that the application's health check page was in error.  Viewing the page showed us that a downstream service (being called over http) was resulting in the following exception...

com.sun.jersey.api.client.ClientHandlerException:
org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool


My gut reaction was that this was a timeout issue (since it usually is) and connections were not being released back to the connection pool after either slow connections or responses.  We saved the metrics info from the application and performed the classic java kill -3 to get a thread dump for later reading.  We tried checking connectivity from the application's host to the downstream service with telnet which revealed no issue.  We also tried simulating a http request with wget which (after a bit of confusion trying to suppress ssl certificate checking) also revealed no issues.  With no more debug ideas immediately coming to mind, restoring service had to be the priority.  A restart was performed and everything went green again.

Looking in the code revealed no obvious problems.  We had set our timeouts like good boy scouts and just couldn't figure out what was wrong.  It clearly needed more investigation.

We see the same thing happening in our Test environment... Great!

With a lot more time to debug (what seemed like) the exact same issue we came up with the idea of looking at the current connections on the box....

netstat -a 

This revealed a lot of connections in the CLOSE_WAIT state.

A simple word count....

netstat -a  | grep CLOSE_WAIT | wc -l 

...gave us 1024 which just so happened to be the exact maximum size of our connection pool.  The connection pool was exhausted.  Restarting the application bought this down to zero and restored service as it did in production.

Now we can monitor our leak

With the command above, we were able to establish how many connections in the pool were in the CLOSE_WAIT state.  The closer it gets to 1024, the closer we are to another outage.  We then had the great idea of piping this command into mailx and spamming the team with the number every few hours.

Until we fix the connection leak - we need to keep an eye on these emails and restart the application before the bucket over fills (i.e. the connection pool is exhausted). 

What caused the leak?

We were using Jersey which in-turn was using HttpComponents4 to send a http request to another service.  The service being called was very simple and did not return any response body, just a status code which was all we were interested in.  If the status code was 200, the user was logged in, anything else we treat as unauthorised.

The code below shows our mistake....

public boolean isSessionActive(HttpServletRequest httpRequest) {
   ClientResponse response = jerseyClient
      .resource(url)
      .type(APPLICATION_JSON_TYPE)
      .header("Cookie", httpRequest.getHeader("Cookie"))
      .get(ClientResponse.class);
  return response.getStatus() == 200;
}


We were not calling close() on the ClientResponse object which meant that the response's input stream was not being closed, which meant that the connection was not released back into the pool.  This was a silly oversight on our part.  However, there's a twist...


We had run performance tests which in turn executed this method a million times (far more invocations than connections in the pool) without us ever hitting any issues.  The question was, if the code was wrong, why was it not failing all the time? 

For every bug, there's a missing test

The task of creating a test which would fail reliably each time due to this issue was 99% of the effort in understanding what was going on.  Before we identified the above culprit, a tester on our team, amazingly, found the exact scenario which caused the leased connections to increase.  It seemed that each time the service responded with a 403, the number of leased connections would increase by one.

Our component tests mocked the service being called in the above code with a great tool called wiremock which allows you to fully mock a web service by sending http requests to a server under your control.  When we examined the number of leased connections after the mocked 403 response, frustratingly it did not increase.  There was something different from our mocked 403 response and the actual 403 response.

The issue was, in a deployed environment, we called the webservice through a http load balancer that was configured to give an html error page if the response was 4xx.  This scenario created an input stream on the ClientResponse object which needed to be closed.

As soon as we adjusted our wiremock config to return a response body, we observed the same connection leak as in our deployed environment.  Now we could write a test that would fail due to the bug.  Rather than stop there, we concluded that there was no point in us checking the status after the one test alone as next time it could fail in a different place.  We added the following methods and annotations to our test base class:

@Before
public void initialiseLeasedConnectionsCount() throws Exception{
   leasedConnectionCounts = getLeasedConnections();
}

@After
public void checkLeasedConnectionsHasNotIncremented() throws Exception{
   int leasedConnectionsAfter = getLeasedConnections();
   if (leasedConnectionCounts != leasedConnectionsAfter){
      fail("Expected to see leasedConnections stay the same but increased from : " + leasedConnectionCounts + " to: " + leasedConnectionsAfter);
   }

}

If we hit another connection leak, we'll know the exact scenario and it shouldn't get anywhere near production this time.  Now we have a failing test, lets make it pass by fixing the connection leak....

public boolean isSessionActive(HttpServletRequest httpRequest) {
   ClientResponse response = null;
   try{
      response = jerseyClient
         .resource(url)
         .type(APPLICATION_JSON_TYPE)
         .header("Cookie", httpRequest.getHeader("Cookie"))
         .get(ClientResponse.class);
      return response.getStatus() == SC_OK;
   }
   finally {
      if (response != null){
         response.close();
      }
   }
}

Lessons Learned


Make sure your alerts alert

Until you have proven you are actually getting emails to your inbox, all your wonderfully clever monitoring/alerting may be useless.  We lost 4 hours of availability this way!
I think that this is so important that I'm tempted to say it's worth causing an incident when everyone is paying attention just to prove you have all the bases covered.  Netflix do this as routine and at random times too.


A connection pool almost exhausted should fire an alert

If we had received an alert when the connection pool was at around 75% capacity, we would have investigated, our connection leak before it caused an outage.  This has also been mentioned by bitly here.  I will endeavour to get this in place ASAP.


Learn your tools

Getting a thread dump by sending your java process the SIGQUIT (i.e. kill -3 [java process id]) is fine but it relies on you having access to the box and permissions to run it.  In our case (using dropwizard and metrics) all I needed to do was go to this url /metrics/threads on the admin port for exactly the same data formatted in json.

Other Lessons?

If you can spot any other lessons learned from this, feel free to comment.

EDIT: See here for a more detailed explanation of how to test for connection leaks and this time with all the code!