Reducing latency, load and costs all by switching a small part of CANDDi to use Node.js
How CANDDi tracks visitors
Since the early days of CANDDi we’ve had four main components in our application to enable us to collect data, transform it into something useful and display it back to our users in their dashboard.
TL;DR A switch to using Node.js for a subsection of our architecture resulted in 25x speed increase, £160 per month saving and simplified automating the roll out of our infrastructure.
- Data collection servers
- Data processing servers
- CANDDi Dashboard (the bit our clients actually use)
The client side tracking code is what you as a CANDDi customer embed on your website, it’s what enables CANDDi to track a visitor’s actions on your site. This data is then sent back to our data collection servers where it is transformed and queued up to be processed. Our processing servers then pull data from this queue and try to make sense of it. Finally all this data ends up in the dashboard where you can view your streams and visitors. For this post we’ll be focusing on the second stage in this process, the collection and queueing of messages.
CANDDi’s tracking code reports back to a series of “endpoints”, each of these endpoints is designed to handle specific types of data, for example a form post or clicked link. Historically these endpoints were written in PHP (along with most of CANDDi) and served though a new generic Apache webserver. For years they happily did their job; accepting incoming messages, transforming them and queueing them up in RabbitMQ ready for processing. However as the CANDDi client base grew, and we tracked the actions of more and more website visitors, we noticed the response time of these servers increased until we realised we had to find a way to scale this service.
A short diversion about scaling
Put quickly, there are three techniques to scaling a piece of technology.
- Horizontal - put more servers doing the same thing in parallel
- Vertical - use the same number of servers but buy bigger or faster kit
- Optimise - rework your codebase and components so that you get more bang for your buck with the kit you have.
To improve the speed of our data collection server we headed down the route of option 3, optimisation. This leaves us free to explore vertical and horizontal scaling if needed in the future. In fact, the new stack will horizontally scale very well when we need it to.
Initially we investigated quick fix solutions such as swapping Apache for Nginx and PHP-FPM. While this was quicker, it still didn’t give us enough of a performance gain. We then started looking at potential other solutions and set up a quick experiment using Node.js.
So why did we migrate a fair amount of code and logic to node.js over another language? In truth it was the first of our investigations and the performance gain was so dramatic it didn’t seem worth looking much further. We already used node.js in other areas of CANDDi both to send real time updates to our dashboard and to deliver most of our emails. This gave us some background into how to write node applications and connect them to other areas of CANDDi.
Starting with a simple express.js application to handle HTTP requests we bolted on a RabbitMQ client to enable us to queue up requests for processing. Between the HTTP client and RabbitMQ client sits some simple logic to parse and validate requests to be queued. We then also handle passing back an HTTP response to the client and that’s basically it.
We then run two Amazon EC2 micro instance servers behind an Elastic Load Balancer which not only helps to balance the load across our servers, but also handles all of our SSL decryption/decryption, simplifying the node instance deployments and saving a few CPU cycles. Each micro instance then runs an Nginx instance to proxy between two node processes. This gives us 4 node processes running on two EC2 micro instances. Each of these node processes is then managed by Supervisord to ensure the process is restarted if it dies for any reason, or that our operations team is alerted if the process cannot be restarted/
In truth we could probably get away with running one process on one box, the CPUs on the instances are peaking around 8% usage. However this is obviously dangerous if one of the boxes dies or gets a sudden influx of data, we would have no resilience.
Results of migration
So how does that compare to our old setup? We used to run two EC2 large instances, which are more expensive to run. So now not only are our new tracker boxes cheaper to run, they’re also more fault tolerant and they are MUCH faster!
Previously we were averaging response times around 0.05s per request, now we’re averaging around 0.002s per request. That’s around 25 times faster for us to track traffic!
The cost per month of our micro EC2 instances is around £20 per box. In contrast, our old servers (EC2 large instances) cost around £100 per box (per month). Which means **we’ve saved (2 x £100) - (2 x £20) = £160 per month!
That’s 2 micro instances, each running 2 node processes, 1 Nginx and 1 Supervisord process. Processing the same (if not more) traffic than our old setup, faster and at a fraction of the cost, and because we’ve scripted the process of building these boxes (a future blog post will cover how this works), if we break one or need another one to cope with a peak in traffic we can build a fresh box, add it to the load balancer and have it process requests within a few minutes.