MapBox is powered by open source, largely by Node.js, Backbone.js, Puppet, and Jekyll, and most of our stack is deployed on Amazon’s high-performance cloud infrastructure. This post describes MapBox’s infrastructure and how we’ve overcome the challenges of hosting worldwide maps at scale and created an efficient platform.
MapBox makes use of cloud services in order to scale quickly with demand and avoid centralization. At any time, we’re running a cluster of EC2 instances as our primary application servers. We can add more if needed within minutes. An Elastic Load Balancer divides traffic between the cluster of running servers and routes around ones that become unresponsive.
We use CloudFront as our CDN to distribute and cache tiles, interactivity grids, map embeds, and API request payloads in over twenty datacenters around the world so maps load quickly for everyone.
The custom maps that you design on your computer using TileMill are exported as MBTiles files and uploaded through S3 before being propagated to the application servers for permanent storage. EBS volumes in RAID configuration are attached to each application server and are responsible for housing the MBTiles files.
We use CloudWatch for monitoring and SES for transactional email. Finally, we use CloudFormation to manage these
services, which allows us to quickly turn on new stacks for both staging and production purposes.
Inside our application servers
We have multiple application servers in operation at any given time. Each server is running a variety of processes, but the essential ones are described below.
The MapBox process
This is the primary custom Node.js application
that serves as the heart and soul of our service.
The server process uses a separate port to handle requests against the MapBox API - requests for tiles, embeds, and metadata. This way we can cache these longer lived
responses against the CDN and keep them super fast.
We can serve tiles fast because they’re stored efficiently on local EBS volumes attached to each application server in
the form of SQLite database files, which we have standardized as the MBTiles format.
Each tile server has disk-level access to the entire collection of MBTiles files. A robust sqlite3 library allows performance to rival other database formats and surpass speeds seen from loading individual map tiles off of a standard filesystem. With MBTiles we can
also dramatically improve the performance of uploading, relocating, and deleting tiles
versus filesystems that typically have very high per tile overheads.
The MapBox tile server is also responsible for our
powerful compositing features
that allow you to take multiple layers and combine them.
This makes custom styles for MapBox Streets possible and results in maps that load significantly faster by reducing the number of HTTP requests and minimizing the amount of pixel information that is transferred. The results of composited tile requests are cached in our CDN as distinct tiles so subsequent requests are even faster.
The other port of this Node.js process is for serving application pages to our users, including the map page, login page, analytics pages, and the map builder features.
Both parts of the application are built with Bones, a client/server application framework for Node.js that uses Express and Backbone.js. It allows you to set up MVC structures once and reuse them both on the server and the client.
We store a variety of documents in CouchDB databases that are replicated among all instances
in the cluster. These documents include users, sessions, tile request analytics, and map metadata.
Nginx is used as a reverse proxy cache and for SSL termination. Certain images and pages are cached for anonymous users. Authenticated traffic on mapbox.com is handled over SSL, and we use Nginx to handle these connections.
Puppet helps manage our server configuration: a micro EC2 instance
runs as a dedicated ‘puppet master’ from which application servers pull configuration updates.
A custom Node.js application reports metrics to CloudWatch.
It is invoked regularly via cron, loads metrics about the health of the instance, and
reports the figures to CloudWatch. We can then configure alarms based on those metrics using the AWS console.
CDN log processor
Some non-critical processes only run on one application server, deemed
“the supernode”. If the supernode fails, another instance in the cluster will automatically
take on that role and spawn these unique processes. For instance, the supernode runs
our log processor, which downloads tile access logs and feeds them to CouchDB
to power our analytics pages.
This task tolerates failures because a newly crowned supernode can “catch up” with
new logs if the previous node dies.
Monitoring, logging, and alerts
We use CloudWatch and Server Density for monitoring, Loggly for log analysis, and
PagerDuty to send incident alerts and manage on-call schedules.
CloudWatch is responsible for monitoring system level metrics such as resource utilization
(CPU, disk, memory), as well as custom metrics that we report to CloudWatch using their API.
For example, we check every minute that our billing system is in sync with the application. If an
interruption results in inconsistent data, an alarm is triggered in CloudWatch,
which will dispatch a notification to the on-call engineer.
Server Density makes regular requests against multiple endpoints on each server and evaluates the response times. If a server doesn’t respond fast enough, an alarm is triggered.
Loggly allows us to quickly search through logs entries across the entire hosting
system. We are running the same software on
multiple machines and Loggly makes it incredibly easy to figure out where an error originated without having to login to each machine directly. We also have a few alerts configured in Loggly. Errors that occur infrequently are difficult to track down, but we can set up an alert and receive a notification as soon as one occurs.
Any time an alarm is triggered in CloudWatch, Server Density, or Loggly, it is sent into PagerDuty, a service that allows us to track incidents and send notifications to whoever is on-call. We have three engineers on-call 24 hours a day. PagerDuty will notify one engineer first and will automatically escalate to the other two if the primary engineer can’t respond to the situation immediately.
Development and deployment strategies
We develop MapBox entirely on our laptops - development never happens on staging server stacks.
This preference affects what tools we use. We look for software, libraries, and services that
can easily be set up on a computer so a new developer can get up to speed quickly. We also focus on making application bootstrap automatic, so many setup tasks are handled automatically when a new developer joining the project starts up for the first time.
We generally follow the GitHub approach to deployment where our master branch is always considered “deployable”. This means we never commit unstable code to the master branch, as it may be deployed at any time by any developer (often multiple times a day).
Instead of using Campfire and Hubot for kicking off a deploy like GitHub, we wrote
a simple bash script for doing so. You pass in a few arguments, including the
destination of the deploy (staging or production) and it connects to each instance
(including the puppet master) and stages a clean build of the latest master branch.
If the build completes on each server, the build is switched into production.
It may not be as feature complete as tools like Capistrano, but it’s incredibly simple and easy to understand.
This overview should give you a sense of how MapBox is engineered to serve custom maps quickly. Our benchmarks have shown that this architecture is able to scale incredibly well. Look forward to more posts about specific technology as well as updates to this post as our service evolves.