December 7, 2020
This is the seventh post of twelve for Mixmax Advent 2020.
As we've written about before, we develop and deploy Mixmax in a number of microservices that serve different areas of our product and have fairly separate jobs. At least that was our original goal back when we first moved to a service-oriented architecture. However, over the intervening 4 years, our services have grown both in number and in actual size. Instead of microservices, we've ended up with their larger, more unwieldy friends, macroservices. While continually adding functionality to our existing services often allowed our small team of engineers (we're hiring) to develop new features quickly, it also gave many of our services an intertwined set of dependencies - both in code and infrastructure - and obfuscated performance bottlenecks.
To untangle our infrastructure and improve observability, we're launching a new initiative as an engineering team to right-size our services. Notably, we aren't deploying a blanket microservice policy, but rather separating out our services into more manageable chunks while improving the stability of each service. We've chosen this approach because, as a team we'd rather move intentionally and ensure all the work we put into our services improves both our efficacy as an engineering team and the experience of our customers while using Mixmax. It will also ensure we have the greatest overall impact. At the end of this initiative, we'll certainly have more, smaller services, but rather than aim for more services, we'd like to aim for a healthier system.
This begs the question of what makes a healthy system. Oftentimes, service and system health are measured in terms of service-level metrics like memory usage, latency, and CPU usage. However, we've found that those metrics only capture a slice of what makes a service truly healthy. Instead, services should be evaluated on a set of standards including performance characteristics, but also extending to operational and maintenance concerns that make the service easier to understand and improve. We've decided that a healthy service should:
With these three goals as our guidance, each goal is broken down into a set of standards a service can meet or exceed.
For example, the testing goal's unit test standard is met when 50% of the service is covered by unit tests. Similarly the goal of streamlining observation has a standard that's met when every related piece of infrastructure is managed through terraform. All told, we have 17 standards spread across the 3 goals listed above, and a healthy service should meet those standards. By ordering these standards according to priority and evaluating whether services meet them, we're able to focus our right-sizing efforts on the services that miss the mark the most and projects that can improve all of our services at once.
As a recent example, while evaluating our system's health, we found that our ability to search through services' code didn't meet our needs as a team. After discovering that, we invested in new code search tooling that improved observability and maintainability of our codebase as a whole.
Not only do these standards help guide our right-sizing efforts now, but they can also be routinely (even automatically) evaluated to create an ongoing history of system health.
Because we've defined these standards, we're able to more clearly focus our efforts on projects that materially improve the health of our system as opposed to a dogmatic commitment to the idea of microservices. In the end, this will lead to smaller, more testable, operational and observable services. With individual, improved services, we'll have a more understandable and healthy system overall.