December 8, 2020
Thanks for tuning in to Mixmax Advent 2020! This is the 8th post in the series; click the link to check out the prior posts and check back for more :)
In this post, I’d like to detail our journey migrating from Elastic Beanstalk (EB) to Fargate, using Terraform to codify our infrastructure along the way. The intended audience for this blog post are interested engineers, with passing familiarity with Amazon Web Services. We gained a lot of flexibility and insight into our systems, were able to recover reasonably quickly from a recent AWS outage, and saved some cash along the way. I hope you enjoy the ride; I know I have!
Enter Elastic Beanstalk. Amazon’s PaaS offering was well suited for the Mixmax of the time, allowing us to quickly deploy code without hassle while supporting deeper customization of environments and allowing us to “own” as much of our infrastructure as we wanted. We could SSH in if we wanted, test tweaks on EC2 instances directly, and pre-provision capacity. We designed custom CI pipelines for Mixmax-specific business logic, and Elastic Beanstalk was yielding enough that what few quirks it did have were overlooked or worked around for years.
Eventually though, it became clear to our engineers that we had pushed the limits of Elastic Beanstalk and would need to imagine another solution. Deploying new code or rolling back took a very long time at our larger scale (sometimes an hour or more), and both processes were flaky with such large instance pools. Elastic Beanstalk did not scale responsively enough to sudden shifts in traffic, and only supported scaling off one metric at a time; eventually, we ended up pinning minimum instance counts to the same value as the maximum - we weren’t scaling at all and were instead spending lots of money to “paper over” capacity issues. These problems alone caused multiple outages and many more headaches.
That was not the only source of issues though. Our infrastructure-as-code tool of choice, Terraform, had poor support for EB when we first migrated, and thus none of these services were implemented in code - they were created by hand. Given how easy it was to accidentally create application drift, consistency was introduced between apps by using one very permissive IAM policy & role, and one very permissive security group attached to everything - a decent approach to consistency, but not decent for security. Finally, service responsiveness & health was largely monitored via load balancers to permit automatic termination of unhealthy instances; this makes sense for web and API services, but was pretty silly for our bee-queue queue workers.
I joined Mixmax in November of 2019 as Mixmax’s first engineer dedicated to infrastructure development & devops, and quickly realized I had my work cut out for me. Not everything was bad; in fact, most things were great. Mixmax had successfully containerized all of its applications, and the constant assessment of how we can improve fostered a mature engineering culture of tooling around development, leaning heavily on CI to eliminate toil and error-prone engineering procedures. One thing was clear to everybody though: Elastic Beanstalk had to go.
Okay, I have to admit, I was more keen on getting things into Terraform than Fargate. Fargate is a fantastic platform - it’s very flexible, a first party service within Amazon with good integrations, and it’s easy to set up. There was also great potential to scale proactively on incoming load too, which promised to save us lots of money and reduce bottlenecks and pain points.
However, as a new Mixmax engineer, I had no idea what was actually running - what was talking to what, what permissions they should have, and what a correct configuration looked like. It was more important in my mind to bring this infrastructure into code, and give changes to it the benefit of git blame and contextualized with business requirements in Jira tickets. Fargate was a means to this end; nobody liked Elastic Beanstalk (I certainly didn’t), and it was simple enough to hit the ground running.
But first, we needed appropriate abstractions in Terraform. We’d need at least two Terraform modules, one for workers and one for web services. These took a little while to create; it meant looking at what the common Mixmax use case in Elastic Beanstalk was, and translating that to a different set of primitives in Fargate (and Application Load Balancers, and security groups, and IAM roles, and etc.) The first revisions of these modules were designed so that an engineer did not need to know many nitty gritty details about networking, security groups, or IAM permissions; they would “just work”.
This turned out to be a nasty anti-pattern; covering every edge case for Mixmax usage in Terraform created really painful and obtuse code, and those edge cases seldom worked well. Eventually, we landed on modules that dependency-injected the important details - subnets, inbound ports, SSL cert ARNs, etc. This was a learning curve for some engineers who hadn’t had to deal with these details before, but it birthed our Fargate cookbook, a first class resource on how to accomplish common tasks with this new infrastructure. It also made testing much easier and cleaner - we had assurances these modules were going to do what they set out to do.
Open sourcing these modules is a current focus of mine; keep an eye on this blog for future announcements ;)
Next, we needed to deploy our software onto these new sets of infrastructure. I had the benefit of our applications already being containerized in Docker and published to our Elastic Container Registry; it was simply a matter of shipping ECS Task Definitions and updating the ECS Fargate Services to use them. Our CI pipeline had to tolerate deploying to both Elastic Beanstalk and Fargate services to facilitate easy testing, cut over, and cut back. Since task definitions contained environment specific details like environment variables, secrets, and file descriptor limits, we opted to consider them deployment artifacts. A task definition for each environment would live within application repositories and would be deployed whenever a release happens, with some mild templating to point to the correct latest container image.
I’ll jump ahead a bit. When everything was ported to Fargate and we no longer needed to support Elastic Beanstalk, our CI pipeline was dramatically simplified; many helper shell scripts were made obsolete, and hacks to support EB were removed. This reduced complexity in our pipeline has reduced the number of deployment oddities substantially. Additionally, deploying to Fargate alone is much faster than deploying to Elastic Beanstalk; getting that important bug fix in front of users, or reverting it, now takes on the order of 5-10 minutes instead of 10-60 minutes.
Another substantial problem with adopting Fargate over Elastic Beanstalk was being able to observe our software while it was running to troubleshoot bugs and performance issues. In the Before Times, engineers were comfortable logging into Elastic Beanstalk instances via SSH and directly inspecting running software with things like strace and flamegraphs. They might view application logs directly on the instance, or fetch them from the Elastic Beanstalk console. There also existed a Sentry instance that collected errors from instances, and Grafana for viewing custom statistics in a sophisticated fashion.
Sentry continues to serve us well, as does Grafana. But flamegraphs, strace, application/web logs and debugging performance would need to look much different in this world. Application and API query logs would need to get shipped somewhere. More saliently, one cannot log onto Fargate instances; there is no SSH daemon running on our Alpine Docker images. We would need ways to infer what a running task is actually doing. We set up basic Cloudwatch monitors and alerting in our Terraform modules, but we’d need to get sophisticated to match the level of introspection we had on Elastic Beanstalk.
I had used NewRelic in a previous life and found it immensely useful. However, NewRelic is somewhat expensive, and being a new engineer in a small (but growing) SaaS company meant I was not in a position to sell my new employers on this expensive product. I’d already made a large case to move to Fargate, and it was important to prove this out first before other large investments.
Fortunately, Elastic, the folks behind Elasticsearch, have created a lovely product called Elastic APM, accomplishing what NewRelic does but also supporting a free license for self-provisioned hardware. I lost a week trying to make this function on AWS’s hosted Elasticsearch, but as it turns out Elastic has not licensed APM as free as in freedom (or open source), but free as in gratis - we had to run it ourselves. Still, this observability was very important for us if we wanted to succeed in porting everything to Fargate and understand what it was all doing.
Almost immediately Elastic APM began paying dividends. We found literal `sleep`s in our application code, one line fixes that doubled performance on heavy endpoints, and redesigned many endpoints and a lot of logic to behave in a smarter fashion. We also added custom transactions to our bee-queue wrapper to make our job queues more performant. While the Elasticsearch instances necessary to host APM cost a decent amount of money, they saved us more than that in performance improvements alone - and additionally helped us fix bugs and streamline the flow of data through our microservices. This continues to be a resounding success.
With Elastic Beanstalk, we were logging all API requests via an intermediate Nginx server, whose logs were eventually shipped to Elasticsearch. However, in Fargate, we had no intermediate Nginx server. We set up our Terraform module to configure ALB logs and shipped those to S3. This worked for a while, but was a bit clunky; we could query them with Athena, but Athena was somewhat expensive to search against our busiest services and wasn’t terribly intuitive. We eventually set up these S3 buckets to be available to our data warehouse; now they’re available for querying with Redash. Shoutout to the Mixmax Data Team!
Additionally, we were fetching application logs in raw text format from Elastic Beanstalk instances. This works, but isn’t terribly sophisticated - if you wanted to fetch every instance’s logs and compare them, you would have to fetch raw text files and run your own analysis on localhost. In contrast, Fargate seamlessly ships `stdout` to Cloudwatch Logs with very little configuration. Cloudwatch Logs also has nice integrations with Elasticsearch. With this setup, we could query to our hearts content.
Lastly, one capability we had on Elastic Beanstalk was the ability to generate flamegraphs. Where Elastic APM shows us what one particular long-running API call is doing, flamegraphs show us what a single node.js process is doing - garbage collection, handling IO, etc. In Elastic Beanstalk, we would collect performance data with perf, copy it to localhost, and generate the flamegraph.
Since we could not use perf easily on Fargate, I opted to create a single EC2 ECS instance that runs Netflix’s nifty Flamescope. Engineers can deploy a task on this cluster-of-one, take their performance measurements, and display them in a web interface so others can pry into the details with them. This is not a capability we use often, so this setup happened pretty late in our migration to Fargate. However, it has helped us illuminate one of the last dark corners in our infrastructure.
Alright, so we now had all the tools necessary to dig into what a service was doing. Now we actually needed to port things. To do so, we had to get a good idea of what was deployed already - what it was doing, how it behaved, what were expected errors and what errors might be new.
When examining existing services, it became clear that there was a lot of wisdom baked into their configurations. The scaling policies, minimum instance counts, and instance counts were not random; they were the result of methodical tweaking to produce a performant Mixmax. There were also many antipatterns as a result of Elastic Beanstalk; minimum instance counts were often much higher than they needed to be because we couldn’t scale responsively, costing us significant amounts of money.
First things first, we had to instrument each service with Elastic APM before even spinning up a Fargate environment. This helped us create a performance baseline to measure each setup against the other. Using this helped determine early in the porting process what an individual service’s performance bottlenecks were; sometimes it was CPU, but in our microservice architecture it was more often IO and external requests to other services. We took note of these bottlenecks and how to measure them so we could create custom autoscaling policies for each. We also took note of what dependencies existed to AWS or other services, and created least privilege security groups and IAM policies that gave the necessary permissions.
Then we had to actually launch instances of our services in Fargate and test them out. Up until this point, I had been off on my own developing abstractions for Mixmax engineers to use; however, once we were at this point I was no longer the expert. Our software developers knew what the software was doing better than I did; they wrote it, after all. This part of the process was intensely collaborative, and everybody learned from each other along the way. Big shoutout to every engineer at Mixmax; literally everyone ported a service and chipped in to this process.
However, the process itself was somewhat vanilla. We started with deploying to our staging infrastructure. We ported API and web traffic via weighted Route53 records, and ported “traffic” to our workers by capping the number of instances workers could spawn at once. For example, if there were 5 Elastic Beanstalk workers, we would launch 1 Fargate worker to consume 1/6th of the total worker queue. However, it became clear early in the migration that small amounts of staging traffic would never produce enough feedback for us to feel comfortable porting to production. We began moving 50% or 100% of traffic in staging to Fargate immediately to get good feedback, while keeping the rest of the engineering team informed so they could watch for problems. Every step along the way, Sentry, Grafana and Elastic APM were monitored to ensure the new infrastructure was not producing unexpected errors.
Mixmax is an expansive set of software; the feedback we got from staging was okay, but not sufficient to exercise every path a bit of data might take through our system. Mixmax is comfortable with experimentation; as engineers, we have reasonable leeway to try out cool new things so long as we can also reasonably back out of any decisions we make. We gradually biased towards getting early feedback from production too - we would port 0.5% of web traffic to an API, or 1 worker out of 40, and watch Sentry and Elastic APM closely. Whenever there was an issue, major or minor, we immediately cut all traffic back to the existing infrastructure and investigated. For some services, this created a drawn-out migration - but what was most important was the protection of our users data and experience, and we seldom disappointed them in the middle of a migration.
After many trials and tribulations, the entire process was done; we were complete. The empty Elastic Beanstalk console was cause for celebration.
Recently, Amazon Web Services suffered a major outage in its us-east-1 Virginia region, where Mixmax is currently deployed in. This affected many services; even some doorbells and vacuums fell victim to this outage. This greatly affected us as well, yet we were able to recover hours before Amazon itself was stable thanks to the efforts to port our infrastructure to Fargate and Terraform. This was an interesting validation of our efforts here, and I’d like to take a moment to discuss them.
At 8:15 AM Eastern Standard Time on November 25th, 2020, an on-call engineer was woken up with an alert that our analytics events were delayed. Our analytics subsystem is based on Kinesis, an AWS service that presents sharded streams of data for consumption by AWS Lambda. In our system, these streams were processed by Lambda and stored in our database for consumption by our web frontend.
Due to the integral nature of Kinesis in processing our analytics data, it soon became clear that Kinesis itself was having issues; multiple services that pushed data into Kinesis or read off it were producing tons of unanticipated errors. This caused some cascading failures in downstream systems, notably AWS Cloudwatch. AWS Cloudwatch manages scaling for our Fargate infrastructure, and without scaling our services stuttered and then halted under the pressure of the day’s normal traffic - lots of Mixmax customers eager to complete their work before the Thanksgiving holiday in the United States.
After some time, it became clear that the issue was present in us-east-1, and not present in other regions. We quickly began efforts to port our infrastructure to us-west-2, the Oregon region in AWS. We quickly spun up replicas of our infrastructure in us-west-2 with copies of the Terraform we used to originally deploy our infrastructure into us-east-1. Using this code, we spun up a network that would have taken an hour or more within 10 minutes, and the Fargate service and Lambdas that consume from Kinesis within the hour. We additionally began adding EC2 capacity to our ECS cluster in us-east-1, and managed some of the scaling ourselves with services that were impacted from Cloudwatch failures. 3 hours before AWS recovered, at 1:13 PM Eastern Standard Time, we brought our analytics subsystem online once again, and restored a working Mixmax application to our customers.
At Mixmax, we believe an outage is a terrible thing to waste. This one was particularly painful, and if we did not learn lessons from it we are doomed to repeat the experience. This has spurred us to invest heavily in a cold/warm second region in Oregon with the cost and efficiency improvements we gained with Fargate. About half this infrastructure was configured on the day of the outage, though we need to formalize more of it and make our failover more seamless and resilient. We dub this current effort the Oregon Trail after a video game many of us played in our youth, though we expect our current efforts to be significantly less painful than the game.
Wow, you’ve read all of it. Maybe. Sorry, this was longer than I intended. I hope it was enjoyable.
Mixmax now enjoys a repeatable and dependable infrastructure, one that is malleable and observable. This process of redeploying everything was long, but we feel that it has levelled Mixmax up, and we exercised this new flexibility to recover relatively quickly during a recent AWS outage. Using this new malleable infrastructure, we are investing in high availability improvements and tooling to run safe experiments with our infrastructure (look out for a future blog post!)
I am personally quite proud of the Terraform modules that have generated this infrastructure for us. We are working on open sourcing these modules, so watch this blog for future updates! If working with our fancy new infrastructure and open source software sounds interesting to you, we are always interested in hearing from engineers (and others!) Check out our careers page! And thanks for reading :)