May 14, 2019
Over the past few months at Mixmax, we’ve been changing the way we track internal metrics. In this series of blog posts, we’re outlining our journey to metrics bliss. In part 1 we explained why we chose to switch from CloudWatch to Graphite. This post is part 2, where we’ll look at the architecture of our Graphite cluster. Part 3 will dive into how we deploy and maintain that cluster with Terraform.
After deciding that Graphite was our preferred option for storing and retrieving metrics, the next step was designing a production system that not only accomplished our primary goals, but also optimized for our secondary goals as well. From part 1, our primary goals required that the Graphite deployment: 1. Support a high (and increasing) volume of requests without client-side downsampling. 2. Allow aggregating metrics on multiple dimensions. 3. Allow alerting on time series metrics.
Our secondary goals were that the deployment: 1. Require minimal maintenance. 2. Easily scale with our existing tooling. 3. Have a useable node.js client. 4. Store data in a fault-tolerant manner. 5. Support incremental rollout.
Graphite handles primary goals 2 and 3 without any configuration, and secondary goals 3 and 5 aren’t dictated by architecture decisions. So when designing our architecture, we needed to ensure that it would:
With our goals in place, we could tackle the details of designing our architecture. Starting from scratch, there are usually 3 pieces that required to store and visualize Graphite metrics: an aggregator to collect and format metrics from many sources, Graphite for storage, and a visualization layer to retrieve and display Graphite data. We chose to use statsd as an aggregator and Grafana for visualization. Both of those tools are well-used, flexible, and commonly integrated with Graphite. They’re so commonly used, that there are more than a few open source options where they’re deployable together as a package. Sweet! We’re done, right?
Unfortunately, these out-of-the-box solutions, while giving a great sandbox for testing, don’t give the flexibility we needed to effectively scale. To illustrate this, here’s an example of one statsd / Graphite / Grafana instance:
In this example, we send metrics to statsd on one port, which aggregates and stores that data via Graphite, and the Mixmax engineer visualizing the data visits Grafana and loads it from the same instance. While this implementation maximizes simplicity, it lacks scalability. If the single instance storing metrics hits limitations that require scaling, the only option is to increase the size of the instance. Why is that the case? If we take the previous example and add a completely new instance, we end up with 2 separate deployments that have no way of communicating to each other. While that may reduce load on one instance, we’d then have to manually decide where to send individual metrics and view those metrics in separate Grafanas.
To start solving this problem, we needed to separate each component into its own instance, allowing us to scale them independently. With independent instances for each component, if a component were to run into limitations, we could scale it horizontally. However, setting up horizontally scaling (sometimes called “clustered”) Graphite was not as simple as it seemed.
With traditional Graphite, each instance must be configured with the addresses of all other instances and be connected by a common “relay.” Each instance is a standalone Graphite instance, the “relay” handles distributing metrics between them, and each instance communicates with the others to fetch all the relevant data for a query (there are other posts about clustering graphite). Generally, clustered Graphite (with statsd and grafana) with 3 Graphite instances would look something like the following:
In the diagram, arrows represent configured connections between separate instances, and dotted lines show instances that may not exist as we scale up or down.
Immediately, a few problems arise with this architecture:
Together, these issues, along with performance limitations of traditional Graphite itself, meant that traditional Graphite would not be easy for us to scale, failing our first primary goal. This led us to look for other options that simplified the architecture and scaling overhead. The go-graphite project offered a higher-performance and more easily scalable alternative. Rather than requiring a relay and connections between each go-graphite instance, we could deploy an api server configured with active go-graphite instances to combine stored metrics. Along with DNS load balancing to distribute incoming metrics, this would allow us to scale horizontally. The go-graphite architecture looks like this:
With the above architecture, incoming metrics are aggregated in statsd and sent through a single DNS record, which then forwards the request to one of the go-graphite instances. When a Mixmax engineer loads Grafana, the api combines the requested metrics from all the go-graphite instances and returns the combined response. When we need to scale the cluster, we could add a new instance and only have to update the DNS record and the api configuration allowing zero-downtime scaling.
So far, this architecture has scaled to process over 250 million events per day with no signs of stopping! Stay tuned for part 3, where we’ll go over how to deploy a Graphite cluster with terraform.
Like designing and deploying scalable systems? Join us!