Determining Why That Server is on Fire

This blog post is part of the Mixmax 2016 Advent Calendar. The previous post on December 8th was about the Gmail PubSub API.

tl;dr - We run our node apps so that we can profile their CPU usage on the fly.

The problem

Have you ever watched the CPU usage of your Node.js process spike on your servers? Did it feel like you were helpless to wonder at the reason why?

Servers on fire

One way to figure out what your process is doing is to use a flamegraph — a visualization of the kinds of work its doing (think stack traces), and how long each component of that work is taking (in terms of CPU usage).

There is already a bunch of amazing material on how to build flamegraphs. However, each example is given only in terms of profiling a process locally (once the user has SSH'd into a server) — there isn't a lot of tooling for profiling servers remotely. Today we're going to give you a pair of scripts that allow you to do just that given that you're running on AWS. These scripts also handle some advanced deployment scenarios and mitigate some bugs reported by other material on flamegraphs.

Too many processes

The first issue we faced when building these scripts was having too many processes to profile: we use Node clustering to get the most out of our instances, however there are very few examples of profiling multiple Node processes at the same time in order to build flamegraphs.

Endless perf map growth

To compound the issue, the JIT symbol map required to build flame graphs grows endlessly.

The solution

As it turns out, you can pass multiple process IDs to the -p flag of perf record as long as they are comma separated. To mitigate the symbol map growth issue, we simply run all of our node processes with the --perf_basic_prof_only_functions flag. It does mean that we do get some unknown entries in our flamegraphs, but it also ensures that we won't run out of memory on our servers any time soon #tradeoffs.

Profiling Node apps

To simplify creating these flamegraphs, we wrote a few bash scripts that build the flamegraphs remotely on server of interest and then extract them back to the engineer's machine.

profile.sh

The first script builds the second, by embedding how long we'd like to profile for and if we'd like to profile all node processes on the box or only a single one. That script looks like:

#!/bin/bash
#
# See: https://mixmax.com/blog/determining-why-that-server-is-on-fire for details.
#
# Example usage: ./profile.sh $INSTANCE_ID [$DURATION] [$PID]
# Where:
# - $INSTANCE_ID is the EC2 identifier of the instance to profile the node
# processes on.
# - $DURATION is an optional duration specified in seconds (defaults to 30).
# - $PID is a specific process ID to profile, by default we profile all `node`
# process on the remote instance.
#
# It is also assumed that you have setup your AWS credentials so that the
# `aws` cli functions correctly.
#
# This script currently assumes that you are profiling EC2 instances inside
# a private VPC and that you ahve setup a bastion tier with SSH forwarding
# appropriately. If this is not the case, simply change
# s/PrivateDnsName/PublicDnsName/.
#
##########

set -e

INSTANCE=$1
DURATION=${2:-30}
PID_MATCHER="echo $3"
if [ -z ${3+x} ]; then
    PID_MATCHER='pgrep node | xargs | sed -e "s\/ \/,\/g"'
fi
echo "building perf script..."
sed -e "s/DURATION/$DURATION/g" perf.sh > perf.tmp.sh
sed -i -e "s/PID_MATCHER/$PID_MATCHER/g" perf.tmp.sh

IP=$(aws ec2 describe-instances --instance-ids $INSTANCE --query "Reservations[*].Instances[*].PrivateDnsName" --output text)

# Collect a sample from the remote node
echo "running the debug script on the remote node: $INSTANCE"
ssh ec2-user@$IP 'bash -s' < perf.tmp.sh

# Cleanup after ourselves
rm -f perf.tmp.sh perf.tmp.sh-e

# Retrieve the svg from remote node
echo "copying the flamegraph back from the remote node..."
scp ec2-user@$IP:/home/ec2-user/flame.svg .

# Open the image in the browser
echo "opening the flamegraph in Chrome..."
open flame.svg -a Google\ Chrome

perf.sh

The second script is a template that is run on the remote instance. The profile.sh script places any duration and PID specific options in this script before scping it to the remote EC2 instance. perf.sh downloads all tools necessary to build the flamegraph, profiles the process(es) of interest and then builds the flamegraph.

#!/bin/bash
#
# See: https://mixmax.com/blog/determining-why-that-server-is-on-fire for usage.
#
###########

echo "installing perf..."
sudo yum install -y perf

echo "fixing file perms..."
sudo chown root /tmp/perf-*.map

DUR=DURATION # this is replaced by profile.sh
PIDS=$(PID_MATCHER) # this is replaced by profile.sh
echo "collecting info for $DUR seconds..."
echo "collecting info on $PIDS..."
sudo perf record -F 99 -o perf.data -p $PIDS -g -- sleep $DUR

echo "running perf..."
sudo perf script -i perf.data > out.nodestacks01

echo "cloning flamegraph..."
git clone --depth 1 http://github.com/brendangregg/FlameGraph
cd FlameGraph

echo "creating flamegraph..."
./stackcollapse-perf.pl < ../out.nodestacks01 | ./flamegraph.pl > ../flame.svg

How big is the fire?

Coincidentally enough, the night before I sat down to write this blog post we had an incident in one of our services. This service in question suddenly began to pin all of its CPUs. No code had been deployed to it recently that could have caused the issue... Neither was there any unusual load on our system... So what could it be? The engineers on call used our profiling script to sample an instance exhibiting the pinned CPU behavior and a healthy one to compare.

Unhealthy server

Unhealthy Server

Healthy server

Healthy Server

Putting out flames

Comparing the two graphs above immediately showed us that the unhealthy server was spending an obscene amount of time trying parsing HTML. This information allowed them to identify a few culprit data pieces that were causing this behavior. They were then able to build a fix and hot-patch the bad data to resolve the issue.

Notes

You may note in the SVGs above that there are many entries of the form: /tmp/perf-$PID.map. This is because we use --perf_basic_prof_only_functions which only outputs the symbols for JS functions and not all symbols (which the --perf-basic-prof flag outputs). We haven't found this to be a large issue in practice however as we still get enough information out of the flamegraphs for our purposes.

Enjoy building greater introspection into infrastructure? Drop us a line.

You deserve a spike in replies, meetings booked, and deals won.

Try Mixmax free

Book more meetings with Mixmax

December 9, 2016

Determining why that server is on fire