April 19, 2019
Over the past few months at Mixmax, we’ve been changing the way we track internal metrics. In this series of blog posts, we’re outlining our journey to metrics bliss. In this post, part 1, we’ll explain why we switched away from CloudWatch. In part 2 we describe the architecture of our Graphite cluster. Finally, part 3 dives into how we deploy and maintain that cluster with Terraform.
As we’ve written before, up until this point we primarily used AWS CloudWatch as our primary method for tracking and alerting on metrics for all our production systems. However, in the last six months we began running into limitations of CloudWatch, which led us to explore other solutions that could scale more gracefully with our increased and expanding needs. The two primary limitations we ran into were:
Combined, these limitations increased the barrier to publishing detailed metrics for new features, and forced engineers on our team to think twice before adding metrication to new features. Practically, it meant that we had fewer metrics on important systems than we’d like. Feeling that our tooling choices should empower engineers rather than limit us, we decided to explore options other than CloudWatch for many of our internal metrics.
To direct our exploration, we outlined a set of requirements for our new metrics tool: the things we value most in metrics tooling. First, we knew the new solution must:
In essence, we needed a scalable solution for aggregating high-cardinality metrics and alerting on the results. We also considered a few other attributes that were important, but not absolutely required. We preferred solutions that:
It’s worth keeping in mind that our solution could have some risk of dropping metrics and didn’t need to store or query full, plaintext payloads. We also ruled out most managed solutions since they offered more functionality than we needed from this metrics tool and required higher-impact changes to our existing tooling. As a result, we considered three self-hosted solutions that satisfy our three requirements:
Prometheus is a time-series metric collection and aggregation server. It offered some interesting options for monitoring remote processes, but required restructuring our backend services to adhere to its polling architecture. We also ruled out Prometheus because it lacked durable storage mechanisms and would require Graphite or InfluxDb to back its metrics storage.
InfluxDb is a time-series database that’s part of a larger stack for ingesting, storing, and alerting on time-series data. We ruled it out because the open source version doesn’t have a solution for scaling or high availability and wasn’t well-supported in the node.js ecosystem.
Graphite is a time-series database system for ingesting and storing metrics. We finally decided on Graphite because it best accomplishes our 5 preferred attributes by being flexible, offering the ability to durably store data and scale horizontally, and being well-supported in the node.js ecosystem. Though it required a bit more initial setup, Graphite has well-tested, scalable, and open-source implementations that don’t require constant maintenance. Additionally, since Graphite itself is just an API for storing and retrieving data, we gained the flexibility to swap out implementations or data stores as necessary.
Continue with part two, where we go over our clustered Graphite architecture which handles hundreds of millions of data points per day.
Interested in working on a data-driven team? Join us!