Monitoring a Scaling Infrastructure at Aviary

 

Serving photo editing effects, frames, and stickers, as well as photo editing tools to millions of users around the globe requires an infrastructure that can grow as demand grows, and a tightly integrated platform to monitor it. At Aviary, we leverage both Amazon’s large suite of tools as well as a number of third-party and custom products to ensure all our environments are operating at peak performance as well as to catch potential issues before they affect our end-users.

NAGIOS

At the core of our infrastructure monitoring platform is the well-known tool, Nagios. We run Nagios with a custom frontend known as Opsview, which provides a clean user interface for the management of hundreds of hosts, as well as an API which we use in our autoscaling environments. At its most basic level, Nagios is an infrastructure monitoring platform which can be endlessly configured to perform checks on various aspects of the infrastructure. At Aviary, a majority of these checks are performed against Amazon EC2 hosts and include disk and memory usage, CPU load, network traffic, and, in the case of hosts serving web pages, page response time. We have assorted other checks in place that test other infrastructure components such as the availability of S3 buckets, expiration dates for certificates, and DNS.

AN AUTOSCALING CHALLENGE

While Nagios is a very configurable system, it does not include native support for adding new EC2 hosts automatically. This was a major issue for us, as more than half of our infrastructure scales up and down with new hosts as the systems’ usage patterns change. Additionally, we could not use some plugins developed to auto-add hosts, because our public IP address ranges are never predictable with autoscaling (VPCs could provide a useful solution here, but also introduces other complexities). After looking into multiple Nagios competitors specifically designed “for the cloud,” we still could not find a match for the stability and feature-set provided by Nagios, and instead began looking into ways to adapt it for auto scaling.

 

Through a customized version of Nagios known as Opsview, we were able to expose an API which hosts could use to add themselves when they first booted and remove themselves when they shut down. This solution, integrated with CloudFormation to perform the initial setup, means that every one of our EC2 instances is automatically configured for monitoring as soon as it boots.


We used CloudFormation to run the following script when each instance boots. Essentially, this script first installs the Opsview agent, then queries the Opsview Server API for a token (by providing a username and password), queries the AWS APIs for host information such as IP address and hostname, and then POSTs to the monitoring endpoint to add itself.

#!/bin/bash

# Installing opsview requirements
sudo apt-get -y -q install libgetopt-mixed-perl libmcrypt4;

# Download the agent
wget https://s3.amazonaws.com/opsview-agents/opsview-agent[version].deb;

# Install the agent
sudo dpkg -i opsview.deb;
sudo service opsview-agent start;

# Opsview API
HOSTGROUP="Linux"
HOSTTEMPLATE="Linux Instance"

USERNAME="user"
PASSWORD="password"
HOST="https://monitoring-server.organization.com"
NAME=$(curl -ss http://169.254.169.254/latest/meta-data/public-hostname)

# Get public IP address for use in monitoring
IP=$(curl -ss http://169.254.169.254/latest/meta-data/public-ipv4)

# Get the instance ID
INSTANCE_ID=$(curl -ss http://169.254.169.254/latest/meta-data/instance-id)

# Login to the Opsview API and receive a token
TOKEN=$(curl -k -ss --data "username=$USERNAME&password=$PASSWORD" $HOST/rest/login | sed -e 's/^.*"token":"\([^"]*\)".*$/\1/')

# Add host to the host lists via CURL
curl -k -ss -X PUT -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $TOKEN" -H "Content-Type: application/json" -d '{"name": "'"$NAME"'","ip": "'$IP'","hostgroup": {"name": "'"$HOSTGROUP"'",},"hosttemplates": [{"name":"'"$HOSTTEMPLATE"'"}], "icon": {"name": "SYMBOL - Network Cloud"}, "hostattributes": [{"name":"AWS_INSTANCE","value":"'"$NAME"'"}, {"name":"AWS_INSTANCE_ID","value":"'"$INSTANCE_ID"'"}]}' $HOST/rest/config/host/

# Reload the engine
curl -k -X POST -H "X-Opsview-Username: $USERNAME" -H "X-Opsview-Token: $TOKEN" -H "Content-Type: application/json" $HOST/rest/reload
Nagios / Opsview Dashboard Showing Healthy Host Checks

Nagios / Opsview Dashboard Showing Healthy Host Checks

MORE IN-DEPTH STATISTICS

Together, Nagios and Opsview provide a very manageable dashboard for quickly determining the state of all our services. However, because they are primarily based on active checks, meaning the monitoring server sends a query to the host to check the status of a service, obtaining detailed statistics at a minute or second level is inefficient. For this reason, we have also begun to use two other data collection tools, StatsD and CollectD, to pipe a steady stream of data to a centralized logging server.


By again using CloudFormation, we added some additional steps to the script above which installs the CollectD and StatsD agents on the host and configures them to send data to a centralized server at regular intervals (CollectD collects system metrics such as CPU or disk usage while StatsD is used within applications to send incremental or time-based metrics). On the server, we installed Graphite, a tool which collects, aggregates, and displays the collected statistics using charts and graphs. Using Grafana, a frontend replacement for Graphite, the resulting metrics page looked quite beautiful.

Graphite / Grafana Dashboard with Second-Level Statistics

Graphite / Grafana Dashboard with Second-Level Statistics

This dashboard allows us to zoom in on second-level statistics to help narrow down memory or CPU spikes, like the one shown above.

All of the instances are configured to automatically begin sending statistics, and Graphite adds them automatically as well, allowing us to easily monitor autoscaling instances.

ALERTING

All of this information would be virtually useless if we could not receive notifications when predefined thresholds were passed. Fortunately, Opsview integrates seamlessly with PagerDuty, a third-party alerting service, to send us email, voice, and text alerts when a particular threshold is triggered. We can then use Graphite to narrow down the issue to a very precise time range, which assists drastically with troubleshooting.

CONCLUSION

With the high-impact of downtime, monitoring should no longer be an afterthought. By integrating it directly into the setup and shutdown processes, even complex autoscaling environments can be monitored with very little effort. Monitoring also does not need to be costly. Although there are hundreds of premium tools and services that provide a wide range of monitoring solutions, everything in Aviary’s platform is accomplished entirely for free (minus the cost of AWS hosting for the monitoring servers). With the insights and performance statistics we have gathered, we are able to continually improve our entire infrastructure to better serve our users.

Questions? Feel free to contact the post’s author, Matt Fuller, at matt@aviary.com.