Predictions put Sandy on a direct course toward our Virginia data center. Our preparations focused on ensuring full availability knowing that one of our data centers was about to get clobbered. We assumed the worst in terms of physical data center impact - that the power would go out for days, generators would fail, physical equipment would be damaged, and the data center would be shut off and temporarily abandoned. But regardless of what was going to happen, our maps needed to keep working and handle the incredible traffic increase during and after the storm.
MapBox runs hot in two data centers, one in Virginia and the other in Ireland, and our Dyn DNS fails over traffic should one go down. We assumed Virginia would go down and were not comfortable running only in Ireland for what could be several days. It was unclear what network traffic would look like if the only data center were in Ireland and with a badly damaged East coast. Based on the idea that Virginia would fail, we drew up the following plan on Friday:
Setup an additional instance of MapBox at AWS’s Oregon data center.
Made it hot by using Dyn’s Traffic Manager to direct all West coast traffic to Oregon.
DNS failover any traffic from VA to OR in the event of a data center outage.
When Sandy hit landfall, MapBox was running in three data centers, handling 450% of normal traffic, and all traffic going to Virginia was ready to fail over to the remaining data centers.
Highly available architecture
Each instance of MapBox within a data center is also designed to be highly available based on the recommendations of AWS. Within each data center a full MapBox instance runs in at least two availability zones, which are basically like separate warehouses of servers, each with separate private networks, power, and internet uplinks. In other words, there are at least two instances of MapBox running in each data center where MapBox is deployed. The use of multiple availability zones does not guarantee an infrastructure to be 100% fail proof. Even though each availability zones has a separate power source, network, and physical location, past issues indicate how one availability zone can affect another based on how the private network across availability zones is used.
These types of issues and issues like the power outage this summer that could easily knock out multiple power supplies are examples of why MapBox runs in multiple AWS regions. However, there are sometimes smaller outages like last week’s event where running in multiple availability zones can prevent a regional outage of your service. Our approach is to design knowing core parts of our system will fail at a point. This level of persistent paranoia helps us avoid failure at as many levels as possible, and this is how we are prepared for AWS single availability zone and region-wide failures.
Track our status
Whether an event is somewhat predictable like Sandy, or unpredictable like the earthquake that affected the East coast last summer, we will be open with how MapBox is designed to cope with such unfortunate occurrences and aim to serve your maps, and to serve them as fast as possible, no matter the circumstances. Find MapBox’s current status on the MapBox status page.
works with the Mapbox engineering team, where he focuses on managing performance, stability, availability, and security across our cloud infrastructure.
Follow @ianshward on Twitter