How to utilize Mapbox Movement data for mobility insights

Virginia Ng
No items found.

Mar 4, 2021

How to utilize Mapbox Movement data for mobility insights

Virginia Ng

Guest

No items found.

Guest

Mar 4, 2021

A guide for analysts, data scientists, and developers

OVERVIEW

Mapbox Movement is the world’s most comprehensive privacy-forward location dataset. It provides a high-definition view of how, where, and when people move through a city or specific geography. Over the past few months, we’ve used Mapbox Movement data to look at activity during the peak of pandemic shutdown in early April, the 2021 United States Presidential Inauguration, and the re-opening and recovery process for retail commerce.

Late last year, we released a full year of the Mapbox Movement data in the US, Germany, and Great Britain, available for free to any Mapbox Developer here. In this blog post, we will walk through a real-life example analysis of this data. We will show you how to access this data, how to visualize it (no code experience needed!), and how to analyze activity at various business locations (in our example, major airports).

Chart 1: Normalized aggregated activity at San Francisco International Airport, Denver International Airport, Dallas/Fort Worth International Airport, and Dulles International Airport. All series normalized to the aggregated activity from January 4 to January 31, 2020. Source: Mapbox Movement data.

At the end of this tutorial, you will be able to apply these methods to analyze any number of industries and problem spaces to get your own interesting insights, similar to ones we’ve done in the retail recovery analysis. All steps in this tutorial, including the map visualizations, can be done in a single Python Jupyter notebook. You can find the public notebook and more details about the code on GitHub and on AWS S3.

WHAT WE’RE GOING TO DO

First, we are going to take a quick look at Mapbox Movement sample data to understand how to access and visualize this data with little to no code. Next, we will use the coordinates and some attributes from OurAirports’ latest airport dataset to find interesting airports to run our analysis on. Based on these coordinates, we will use some geospatial tools to help us draw approximate boundaries for each airport. We will then analyze and compare the spatial and temporal movement trends within a few of these airport boundaries over the course of 2020.

WHAT IS MAPBOX ACTIVITY INDEX?

Mapbox Movement provides measures of relative human activity (walking, driving, etc) over both space and time, provided as an activity index. The activity index is a Mapbox proprietary metric that reflects the level of activity in the specified time span and geographic region. This index is calculated by aggregating anonymized location data from mobile devices by day or month, as well as by geographic tiles of approximately 100-meter resolution. These geographic tiles are based on quadkeys—a numeric reference to a geographic partition at a specified OpenStreetMap zoom level. The activity per day and tile is normalized relative to a baseline, where the baseline represents activity in the 99.9th percentile day/tile of January 2020 in each country. In other words, a city block that had the most activity in January 2020 in the US would have an activity index value of around 1, while a city block that had half that activity on a day in March would have an activity index value of 0.5. See the Movement documentation for more details on how the activity index is calculated.

A QUICK AND EASY WAY TO VISUALIZE ACTIVITY DATA WITH NO CODE!

An image is worth a thousand words, especially when we need to present results to our teams. It’s important to know how to quickly visualize the data and get an understanding of the Mapbox Movement dataset.

Click on this link to download a sample CSV file consisting of January 1st, 2020 Movement data covering the San Francisco Bay Area.

The sample CSV file consists of January 1st, 2020 Movement data covering the San Francisco Bay Area.

The CSV file is an example of activity index aggregated by day, by quadkey: 

  • The bounds field provides the outer corner coordinates of each quadkey (roughly the size of a city block), while the xlat and xlon fields show the centroid of each quadkey.
  • The agg_day_period field should show 2020-01-01 in this example.
  • The most important field is the activity_index_total, which tells us the normalized activity for each city block. 
  • Each line item in this CSV file has already been converted to latitude and longitude coordinates. All we need to do is drag and drop this CSV file into Kepler.gl to visualize the activity index. 

Each point in this map represents a quadkey. We can hover over each point to view its activity level, centroid, and date. We can also zoom in and out of the map to observe micro versus macro activity patterns:

If you would like to access a full year of Mapbox Movement sample data in the US, Germany, and Great Britain, submit a sample request and follow the instructions to download the data. Continue with the tutorial to learn how to download Movement data and analyze activity at specific geographic locations.

GETTING SET UP

  1. Have access to Mapbox Movement sample dataset (Request access to the Movement sample dataset for free).
  2. Download the file named airports.csv from OurAirports’ latest Airports dataset. We will be using the following attributes from this dataset: type, name, latitude_deg, longitude_deg, iso_country, iso_region, municipality, iata_code.
  3. Have the following list of python dependencies installed in your virtual environment. My virtual environment and Jupyter notebook are running on Python 3.7.2.
  1. (Optional) Download the Jupyter notebook from GitHub or AWS S3 to follow along. Note that you will need to have valid AWS credentials set before starting the notebook server. These AWS credentials should be set in the same terminal window as the terminal window you are running the Jupyter notebook in.

VISUALIZING MOVEMENT DATA IN A JUPYTER NOTEBOOK

Besides visualizing Mapbox Movement data in the web-based version of Kepler.gl, we can also load the sample CSV file consisting of January 1st, 2020 San Francisco Bay Area Movement data in a Jupyter notebook. Make sure you have the Jupyter version of kepler.gl installed in your virtual environment. You can use the following function download_csv_to_df.() to download the sample CSV file to the same directory as the Jupyter notebook:


import os
import requests
from io import StringIO
import pandas as pd

def download_csv_to_df(url, parse_dates=False):
    """
    Given a url, downloads and reads the data to a Pandas DataFrame.
    
    Args:
        url (string): includes a protocol (http), a hostname (www.example.com),
            and a file name (index. html.
        parse_dates (boolean or list) : boolean or list of columns names (str), default False.
            
    Raises:
        requests.exceptions.RequestException: An exception raised while
            handling your request.
        
    Returns:
        a Pandas DataFrame.
    """
    try:
        res = requests.get(url)
        
    except requests.exceptions.RequestException as e:
        raise SystemExit(e)

    return pd.read_csv(StringIO(res.text), sep='|', parse_dates=parse_dates)

Paste the following commands into a Jupyter notebook cell, hit Run, and watch the Movement data get loaded into a kepler.gl map like the one you see below. You will see further down in the tutorial that this is a handy tool to use when we’re drawing and visualizing boundaries around each airport.

If you are using JupyterLab instead of the classic Jupyter notebook, you may run into issues with loading the kepler.gl widget. The workaround is to save the map as an interactive HTML file with the .save_to_html() method. You can then interact with this HTML file in your favorite web browser.


from keplergl import KeplerGl

# Download the sample CSV file consisting of January 1st, 2020 Movement data covering the San Francisco Bay Area.
movement_url = 'https://mapbox-movement-public-sample-dataset.s3.amazonaws.com/v0.2/daily-24h/v0.1.2/US/quadkey/total/2020/01/01/data/0230102.csv'
movement_df = download_csv_to_df(movement_url)

# Rename a few columns for clarity:
movement_df = movement_df.rename(
    columns={"xlon": "lon", "xlat": "lat"}
    )[["lon", "lat", "activity_index_total"]]

# Load an empty keplergl map
sample_z7_map = KeplerGl()

# Add movement sample csv. Kepler accepts CSV, GeoJSON or DataFrame!
sample_z7_map.add_data(data=movement_df, name='data_2')

# Optional: Save map to an html file
sample_z7_map.save_to_html(file_name='sample_z7_map.html')

# Load kepler.gl widget below a cell
sample_z7_map

Kepler.gl map loaded in a Jupyter notebook cell to visualize San Francisco Bay activity on January 1, 2020.

AIRPORT ACTIVITY ANALYSIS

Let's begin by finding the geolocation of the airports we want to study. First, use the previously defined function download_csv_to_df() to download the file airports.csv from the latest dataset by OurAirports and convert the data into a Pandas DataFrame:


# Download airport dataset
airport_url = "https://ourairports.com/data/airports.csv"
airports_df = download_csv_to_df(airport_url)

We will then drop all unnecessary columns and filter the DataFrame to only include major US airports:


# Drop unnecessary columns
airports_df = airports_df.drop(columns=['id', 'ident', 'elevation_ft','continent',  'scheduled_service', 'gps_code', 'local_code', 'home_link', 'wikipedia_link', 'keywords'])

# Filter to US only airports
us_airports_df = airports_df[airports_df["iso_country"] == 'US']

# Filter dataframe to include only large airports
us_airports_df_large = us_airports_df[us_airports_df["type"] == "large_airport" ]
us_airports_df_large = us_airports_df_large.sort_values(by=["iso_region"])

# Inspect the first few rows of the DataFrame
us_airports_df_large.head()

Here are the first five rows of this DataFrame:

First five rows of the us_airports_df_large DataFrame

In order to aggregate and analyze the activity associated with each airport, we will need to define the boundaries for each airport. There are close to 25,000 airports based in the US in this airport dataset. 170 of them are categorized as “large airports”. We can manually draw the boundaries for each, but this will be time-consuming. A more efficient way is to make use of the geospatial libraries like Shapely and Mercantile to help us draw approximate boundaries at scale.

Let’s first convert each airport’s latitude and longitude pair into a Shapely Point object, which represents a single point in space:


from shapely.geometry import Point

def coords2points(df_row):
    """
    Converts "longitude_deg" and "latitude_deg" to Shapely Point object.
  
    Args:
        df_row (Pandas DataFrame row): a particular airport (row) or a pandas.core.series.Series
            from the original airports DataFrame.
    
    Returns:
        A shapely.geometry.point object.
  """
    return Point(df_row['longitude_deg'], df_row['latitude_deg'])

# Apply the coords2points function to each row and store results in a new column 
us_airports_df_large['airport_center'] = us_airports_df_large.apply(coords2points, axis=1)

Next, we will create a circle with a radius 1,000 meters around each airport’s Shapely Point object via Shapely’s buffer() method. This circle will serve as our “approximate boundary” for every airport. It is stored as a Shapely Polygon object. Because we will be calculating circles on the surface of the Earth, this process involves a bit of projection magic. For more information about why we need to transform the coordinates back and forth between AEQD projection and WGS84, please see section I in the Appendix.


from functools import partial

import pyproj
from shapely import geometry
from shapely.geometry import Point
from shapely.ops import transform

def aeqd_reproj_buffer(center, radius=1000):
    """
    Converts center coordinates to AEQD projection,
    draws a circle of given radius around the center coordinates,
    converts both polygons back to WGS84.
    
    Args:
        center (shapely.geometry Point): center coordinates of a circle.
        radius (integer): circle's radius in meters.
    
    Returns:
        A shapely.geometry Polygon object for cirlce of given radius.
    """
    # Get the latitude, longitude of the center coordinates
    lat = center.y
    long = center.x
    
    # Define the projections
    local_azimuthal_projection = "+proj=aeqd +R=6371000 +units=m +lat_0={} +lon_0={}".format(
        lat, long
    )
    wgs84_to_aeqd = partial(
        pyproj.transform,
        pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
        pyproj.Proj(local_azimuthal_projection),
    )
    aeqd_to_wgs84 = partial(
        pyproj.transform,
        pyproj.Proj(local_azimuthal_projection),
        pyproj.Proj("+proj=longlat +datum=WGS84 +no_defs"),
    )

    # Transform the center coordinates from WGS84 to AEQD
    point_transformed = transform(wgs84_to_aeqd, center)
    buffer = point_transformed.buffer(radius)
    
    # Get the polygon with lat lon coordinates
    circle_poly = transform(aeqd_to_wgs84, buffer)
    
    return  circle_poly

Repeat this step for all airports by applying the aeqd_reproj_buffer.() function above to all rows in the us_airports_df_large DataFrame:


# Map the aeqd_reproj_buffer function to all airport coordinates.
us_airports_df_large["aeqd_reproj_circle"] = us_airports_df_large["airport_center"].apply(aeqd_reproj_buffer)

We can then use kepler.gl to visually inspect the center coordinates and 1 km circle around any airport.


from shapely.geometry import mapping
from keplergl import KeplerGl

def plot_circles_kepler(airport, to_html=False):
    """
    Plots the center and 1 km circle polygons using kepler.

    Args:
        airport (pandas.core.series.Series): an specific airport (row) from the 
            main airports DataFrame.
        to_html (boolean): if True, saves map to an html file named
            {airport_iata_code}_circle.html.
    
    Returns:
        a kepler interactive map object.
    """
    # Get the center (Shapely Point), circle (Shapely Polygon), and iata code of the airport
    circle_center = airport["airport_center"]
    circle_poly = airport["aeqd_reproj_circle"]
    iata = airport["iata_code"]

    # Define the GeoJSON object for the 1 km circle
    circle_feature = {
        "type": "Feature",
        "properties": {"name": "Circle"},
        "geometry": mapping(circle_poly)
    }

    # Define the GeoJSON object for the center of the circle
    center_feature = {
        "type": "Feature",
        "properties": {"name": "Center"},
        "geometry": mapping(circle_center)
    }
    
    # Define the Kepler map configurations
    config = {
        'version': 'v1',
        'config': {
            'mapState': {
                'latitude': airport["latitude_deg"],
                'longitude': airport["longitude_deg"],
                'zoom': 12
            }
        }
    }

    # Load the keplergl map of the circle and add center
    circle_map = KeplerGl(data={'Circle': circle_feature})
    circle_map.add_data(data=center_feature, name='Center')
    circle_map.config = config
    
    # Optional: Save map to an html file
    if to_html:
        circle_map.save_to_html(file_name=f'{iata}_circle.html')

    return circle_map

For example, run the following code snippet to visualize the area around San Francisco International Airport (SFO):


# Plot 1 km circle polygon around San Francisco International Airport
sfo = us_airports_df_large[us_airports_df_large["iata_code"] == "SFO"].iloc[0]
plot_circles_kepler(sfo)

kepler.gl provides a set of Mapbox basemap styles and allows us to easily change the background map. Open the Base Map Panel and try different background map styles from a list of default map styles. Here I've selected Mapbox Satellite:

Mapbox Movement quadkey data are aggregated at zoom 18, which is roughly the size of a city block (approximately 100 meter resolution). You can use What the Tile interactive tool to visualize quadkey tiles of various sizes. For example, here are some of the zoom 18 quadkeys that overlap with the 1 km circle we drew at San Francisco International Airport (SFO):

Examples of the zoom 18 quadkeys that that overlap with the 1 km circle at San Francisco International Airport (SFO).

The zoom 18 activity data are stored in CSV files, with each file containing data for a single zoom 7 tile. We can use What The Tile to look up the zoom 7 quadkey number for the area containing SFO airport—in this case, the zoom 7 number is 0230102 (Incidentally, the zoom 7 number is just the first 7 digits of any zoom 18 numbers that cover the airport). The zoom 7 tile covers a much larger region of the San Francisco Bay Area:

Zoom 7 quadkey that overlap with the 1 km circle we drew at San Francisco International Airport (SFO)

The January 1st daily activity data for the zoom 7 tile containing SFO and the rest of the Bay Area is stored under the following path in AWS S3: s3://mapbox-movement-public-sample-dataset/v0.2/daily-24h/v0.1.2/US/quadkey/total/2020/01/01/data/0230102.csv.

A more efficient way is to use the generate_quadkeys.() method below to find all quadkey tiles that lie within 1 km radius of each airport for a given zoom level. When we input the circle Shapely Polygon object we created for SFO and 7 as the zoom, the function returns the zoom 7 quadkey 0230102. When we input the circle Shapely Polygon object we created for SFO and 18 as the zoom, the function returns the full list of zoom 18 quadkeys as the ones you saw earlier from What The Tile. Note that some airports' polygons span more than a single z7 quadkey.


import mercantile

def generate_quadkeys(circle_poly, zoom):
    """
    Generate a list of quadkeys that overlap with an airport.
    
    Args:
        circle_poly (shapely.geometry Polygon): circle polygon object drawn 
            around an airport.
        zoom (integer): zoom level.
        
    Return:
        List of quadkeys as string
    """

    return [mercantile.quadkey(x) for x in mercantile.tiles(*circle_poly.bounds, zoom)]

We can use this method to quickly generate zoom 18 and zoom 7 quadkeys for all airports. Recall that we've generated 1 km circle polygons for each airport and stored them in the aeqd_reproj_circle column of the us_airports_df_large DataFrame.


# Create a list of overlapping z18 quadkeys for each airport and add to a new column
us_airports_df_large['z18_quadkeys'] = us_airports_df_large.apply(lambda x: generate_quadkeys(x['aeqd_reproj_circle'], 18),axis=1)

# Create a list of overlapping z7 quadkeys for each airport and add to a new column
us_airports_df_large['z7_quadkeys'] = us_airports_df_large.apply(lambda x: generate_quadkeys(x['aeqd_reproj_circle'], 7),axis=1)

Once we have these zoom 18 and zoom 7 quadkeys, we are now ready to download some activity data from the Mapbox Movement sample dataset. The download_sample_data_to_df() function below lets us download the raw activity data and create a Pandas DataFrame spanning all dates between a given start and end date.


import os
import requests
from datetime import datetime, timedelta

def download_sample_data_to_df(start_date, end_date, z7_quadkey_list, local_dir, verbose=True):
    """
    Downloads Movement z7 quadkey CSV files to a local dir and reads
    Read the CSV file into a Pandas DataFrame. This DataFrame contains
    **ALL** Z18 quadkey data in this Z7 quadkey.
    
    Args:
        start_date (string): start date as YYYY-MM-DD string.
        end_date (string): end date as YYYY-MM-DD string.
        z7_quadkey_list (list): list of zoom 7 quadkeys as string.
        local_dir (string): local directory to store downloaded sample data.
        verbose (boolean): print download status.
    
    Raises:
        requests.exceptions.RequestException: An exception raised while
            handling your request.
        
    Returns:
        a Pandas DataFrame consists of Z7 quadkey data. DataFrame contains
        **ALL** Z18 quadkey data in this Z7 quadkey.
    
    """

    bucket = "mapbox-movement-public-sample-dataset"
    
    # Generate range of dates between start and end date in %Y-%m-%d string format
    start = datetime.strptime(start_date, '%Y-%m-%d')
    end = datetime.strptime(end_date, '%Y-%m-%d')
    num_days = int((end - start).days)
    days_range = num_days + 1
    date_range = [(start + timedelta(n)).strftime('%Y-%m-%d') for n in range(days_range)]

    sample_data = []
    for z7_quadkey in z7_quadkey_list:
        for i in range(len(date_range)):
            yr, month, date = date_range[i].split('-')
            url = f"https://{bucket}.s3.amazonaws.com/v0.2/daily-24h/v0.1.2/US/quadkey/total/2020/{month}/{date}/data/{z7_quadkey}.csv"
    
            if not os.path.isdir(os.path.join(local_dir, month, date)):
                os.makedirs(os.path.join(local_dir, month, date))

            local_path = os.path.join(local_dir, month, date, f'{z7_quadkey}.csv')

            if verbose:
                print (z7_quadkey, month, date)
                print (f'local_path : {local_path}')

            try:
                res = requests.get(url)
                df = pd.read_csv(StringIO(res.text), sep='|')
                convert_dict = {'agg_day_period': 'datetime64[ns]', 'activity_index_total': float, 'geography': str} 
                df = df.astype(convert_dict)
                # Keep leading zeros and save as string
                df['z18_quadkey'] = df.apply(lambda x: x['geography'].zfill(18), axis=1).astype('str')
                df['z7_quadkey'] = df.apply(lambda x: x['geography'][:6].zfill(7), axis=1).astype('str')
                
                sample_data.append(df)
                df.to_csv(local_path, index=False)

            except requests.exceptions.RequestException as e:
                raise SystemExit(e)

            if verbose:
                print (f'Download completed for {z7_quadkey} over date range {start_dt} to {end_dt}')

    return pd.concat(sample_data)

We can pass in multiple zoom 7 quadkeys from a single airport or zoom 7 quadkeys from multiple airports. Take San Francisco International (SFO) and Denver International (DEN) for example, we can download all activity data from January 1, 2020 to January 2, 2020:


# Tweak the following set of parameters to include broader time frame or more airports
start_dt_str = "2020-01-01"
end_dt_str = "2020-01-02"

# Find SFO and DEN from the us_airports_df_large DataFrame
sfo = us_airports_df_large[us_airports_df_large["iata_code"] == "SFO"].iloc[0]
den = us_airports_df_large[us_airports_df_large['name'].str.contains("Denver")].iloc[0]

# Add these two airports
airports = [sfo, den]

# Creates a list of z7 quadkeys to download
z7_quadkeys_to_download = []
for airport in airports:
    for z7_quadkey in airport["z7_quadkeys"]:
        z7_quadkeys_to_download.append(z7_quadkey)

# Define a list to append all newly created DataFrames
sample_data_airports = []
for z7 in z7_quadkeys_to_download:
    local_directory = os.path.join(os.getcwd(), f'sample_data_{z7}')
    print ([z7])
    print (local_directory)

    # Run the download script
    sample_data_airports.append(download_sample_data_to_df(start_dt_str, end_dt_str, [z7], local_directory, False))

# Create a DataFrame of all z7 quadkey activity data
sample_data_airports_df = pd.concat(sample_data_airports).sort_values(by=['agg_day_period', 'z18_quadkey'])

This Mapbox Movement data set is aggregated by day, where the dates are specified under the column named agg_day_period. There is a single activity index value (column named activity_index_total) per date per zoom18 quadkey (column named z18_quadkey). Let’s first filter the DataFrame, so that it only includes data from all the regions containing SFO's 1 km circle polygon.


# Filter DataFrame to include activity data from SFO z7 quadkeys only. Excludes DEN z7 quadkey data.
sample_data_sfo = []
for z7_quadkey in sfo["z7_quadkeys"]:
    sample_data_sfo.append(sample_data_airports_df[sample_data_airports_df["z7_quadkey"] == z7_quadkey])

# Create a DataFrame for all SFO activity data
sample_data_df_sfo = pd.concat(sample_data_sfo)

Note that there is a lot more zoom 18 quadkey data in this initial DataFrame for SFO than we need. Not all of these zoom 18 quadkeys overlap with our 1 km circle centered at SFO. This is why it is important for us to filter this DataFrame and remove unneeded zoom 18 quadkey entries in the following step:


# Filter the DataFrame to include only entries of Z18 quadkey that overlap with the 1 km circle
z18_sample_data_sfo_df = sample_data_df_sfo[sample_data_df_sfo["z18_quadkey"].isin([z18_qk for z18_qk in sfo["z18_quadkeys"]])]

We now have a DataFrame that only include entries of relevant zoom 18 quadkeys that overlap with SFO's 1 km circle over the course of two days in January, 2020:

Output of the z18_sample_data_sfo_df DataFrame (2020-01-01 to 2020-01-02)

Compared to the two day example we just saw, downloading and filtering a full year's worth of data takes longer to run. In the interest of saving time, we've pre-processed the download step of 2020-01-01 to 2020-12-31 activity data for four airports—Dallas/Fort Worth International Airport (DFW), Denver International (DEN), Dulles International Airport (IAD), and San Francisco International (SFO). This data has been compressed and uploaded as zip files to the public Mapbox Movement sample data bucket on AWS S3. In section III of the Appendix, we will show you how to access these zip files and how these larger zoom 7 quadkey DataFrames can be useful in our deep dive analysis.

We've also pre-processed the filtering step and uploaded activity data from all relevant zoom 18 quadkeys that overlap with each of the four airport's 1 km circles to the same S3 bucket. We can use the previously defined function download_csv_to_df() to download this data. Continue to use SFO as an example,


# Find the San Francisco International Airport row in the DataFrame
sfo = us_airports_df_large[us_airports_df_large["iata_code"] == "SFO"].iloc[0]

# Define the url of SFO's pre-processed CSV file
z18_sample_data_sfo_url = 'https://mapbox-movement-public-sample-dataset.s3.amazonaws.com/airports_dataset/data/z18_sample_data_sfo.csv'

# Download filtered sample data for sfo and read into a DataFrame. Parse agg_day_period as datetime object
z18_sample_data_sfo_df = download_csv_to_df(z18_sample_data_sfo_url, parse_dates=['agg_day_period'])

We also want to get the IATA airport code from our original US large airport DataFrame, so that we can distinguish one airport from another when we’re doing the comparisons later. Finally, we will sort the DataFrame by date.


# Find the total number of all Z18 quadkeys that overlap with the 1 km2 radius circle
sfo_z18_sample_df['total_num_z18_qks'] = float(len(set(z18qks_sfo)))

# Get the IATA airport code
sfo_z18_sample_df['iata'] = sfo['iata_code']

# Sort by date
sfo_z18_sample_df = sfo_z18_sample_df.sort_values(by=['agg_day_period', 'z18_quadkey'])

This is what the resulting DataFrame looks like:

Output of the z18_sample_data_sfo_df DataFrame (2020-01-01 to 2020-12-31)

We can repeat this step for the remaining three airports—Dallas/Fort Worth International Airport (DFW), Denver International (DEN), and Dulles International Airport (IAD). You can find more detailed code in the Jupyter notebook, available on GitHub and AWS S3.

Next, we want to generate some interesting statistics for each airport.

I. AGGREGATE THE DAILY ACTIVITY INDEX FOR ALL Z18 QUADKEYS OVERLAPPING THE TARGET AIRPORT

The following code snippet calculates the aggregated activity per day for SFO:


# Calculate the sum of activity per day and construct a DataFrame
sum_data = sfo_z18_sample_df.groupby(['agg_day_period', 'total_num_z18_qks', 'iata']).sum()['activity_index_total']

# rename the column
sfo_ai_stats_df = pd.DataFrame(sum_data).reset_index().rename(columns={'activity_index_total':'sum_ai_daily'})

# Set agg_day_period as the index
sfo_ai_stats_df = sfo_ai_stats_df.set_index('agg_day_period')

Repeat this step for all three other airports, plot and compare the daily sum of activity indices of all four airports. The script that generates the chart below can be found in the Jupyter notebook.

Chart 2: Aggregated daily activity at San Francisco International Airport, Denver International Airport, Dallas/Fort Worth International Airport, and Dulles International Airport from January 1 to December 31, 2020. Source: Mapbox Movement data.

From the chart above, we can see that DFW generally has the highest levels of activity of the four airports, with DEN coming in second. Although the activity in all airports has decreased after the start of the pandemic, this pattern holds throughout the year.

This analysis comes with caveats. Recall that we estimated the area of each airport to be a circle with a radius of 1 km around a centroid point. This may not be entirely accurate. For example, some airports may be larger than others, and the centroid we used may be closer to the terminals in some airports. There may even be more subtle differences between airports. For example, people may move more across some airports because the terminals are farther from the baggage claims. To properly compare absolute activity levels, we would need to define the area of each airport more precisely (more on that below), as well as potentially take other features of the airports' activity into consideration.

II. COMPARE POST-PANDEMIC CHANGES IN ACTIVITY BETWEEN AIRPORTS

Instead of comparing absolute activity totals, it may be more informative to compare the changes in activity patterns between airports over time. To do so, we can normalize the activity in all airports over the same month of January 2020 (we exclude January 1 to January 3, because there’s usually atypical activity patterns around the New Years). We can then visualize the way the time series of activity at different airports diverge from this starting value.

This analysis is useful to understand, for example, whether activity fell at some airports more than others after the start of the pandemic, and whether it bounced back more in some airports. This type of analysis is also more resilient against the issues described above—as long as we assume that the airports themselves haven't changed, we don't need to have a precise definition of each airport's area or an understanding of its structure to compare the directionality of activity changes between airports.


from datetime import timedelta, date

# Find the timestamp for Jan 4 to Jan 31, 2020 and generate a list of dates
jan_start_date = datetime.strptime('2020-01-04', "%Y-%m-%d")
jan_end_date = datetime.strptime('2020-01-31', "%Y-%m-%d")
jan_dates = pd.date_range(jan_start_date, jan_end_date, freq='D')

# Compute the aggregated activity from Jan 4 to Jan 31, 2020 for SFO
sfo_sum_ai_Jan_2020 = sfo_ai_stats_df[sfo_ai_stats_df.index.isin([x for x in jan_dates])].sum()['sum_ai_daily']
sfo_ai_stats_df["sum_ai_Jan_2020"] = sfo_sum_ai_Jan_2020

# Normalize the daily sum of activity by the total sum of activity from Jan 4 to Jan 31, 2020, add to DataFrame.
sfo_ai_stats_df["normalized_sum"] = 28 * sfo_ai_stats_df["sum_ai_daily"] / sfo_sum_ai_Jan_2020

# Inspect the DataFrame
sfo_ai_stats_df

The normalized activity index curves below allow us to compare the activity index at the peak of the pandemic shutdowns on April 12, 2020 to the summer and winter months across all four airports:

Chart 3: Compare normalized aggregated daily activity on April 12, 2020 to Thanksgiving at San Francisco International Airport, Denver International Airport, Dallas/Fort Worth International Airport, and Dulles International Airport. to December 31, 2020. All series normalized to the aggregated activity from January 4 to January 31, 2020. Source: Mapbox Movement data.

We can see that in general, all four airports show similar activity trends throughout the year. The above chart confirms the sharp decline in air travel starting from mid-March to April due to the travel restriction.

And that’s it! We have now replicated the retail recovery analysis with airports instead of supermarkets. This same methodology can be used to analyze any number of industries and problem spaces. If the above chart looks too noisy due to day-to-day fluctuations, we may want to consider computing a rolling average, to smooth out the graphs and more easily visualize macro trends. This takes us to compute the next statistic.

III. COMPUTE THE 7-DAY ROLLING AVERAGE OF ALL Z18 QUADKEYS OVERLAPPING THE TARGET AIRPORT

This produces a smooth curve that evens out the spikes and fluctuations we observe in the aggregated daily sum of activity indices for each airport.


# Calculate the 7-day rolling average of activity index.
sfo_ai_stats_df["sum_ai_daily_rolling_avg"] = sfo_ai_stats_df["sum_ai_daily"].rolling(window=7).mean()

Plot the 7-day rolling average activity indices for all four airports:

Chart 4: 7-day rolling average of aggregated daily activity of San Francisco International Airport, Denver International Airport, Dallas/Fort Worth International Airport, and Dulles International Airport from January 1 to December 31, 2020. Source: Mapbox Movement data.

Interestingly, but as expected, the loss of activity for Thanksgiving gets absorbed by the rolling average, so it’s important to choose the right visualization depending on the use case and the insights you are presenting.

We can also create a 7-day rolling average for the normalized aggregated daily sum of activity indices from Chart 2 above to smooth out the spikes and fluctuations.


# Compute the 7-day rolling average of the normalized daily aggregated activity
sfo_ai_stats_df["norm_sum_rolling_avg"] = sfo_ai_stats_df["sum_ai_daily"].rolling(window=7).mean()

Plot the 7-day rolling average of normalized aggregated daily sum of activity indices for all four airports:

Chart 5: 7-day rolling average of normalized aggregated daily activity of San Francisco International Airport, Denver International Airport, Dallas/Fort Worth International Airport, and Dulles International Airport from January 1 to December 31, 2020. All series normalized to the aggregated activity from January 4 to January 31, 2020. Source: Mapbox Movement data.

The 7-day rolling averages help us more easily observe the macro changes from the peak of the pandemic shutdowns on April 12, 2020 to the summer and winter months across all four airports. After normalization, we can see that DEN now ranks first in activity level, with DFW coming in second. It’s also easy to spot the increase in activity levels at DEN and DFW during the re-opening phase over the summer months, reaching approximately half the activity levels of the pre-pandemic months.

Of course, our analysis need not be limited by a simple rolling average.We can also apply other time series analysis methods to smooth out the series; for example, we might want to normalize by variation across days of the week, or analyze weekday and weekend traffic separately. See section II in the Appendix for additional information.

BUT WAIT! AREN'T THESE AIRPORTS MUCH LARGER THAN THE 1 KM CIRCLES?

Now, let’s address the elephant in the room–the fact that we drew circles with 1 km radius around each airport regardless of size and passenger traffic. According to the FAA, SFO occupies 5,207 acres (21.07 sq. km), while DEN covers 33,531 acres (52.4 sq mi; 135.7 sq. km). In terms of passenger traffic, Denver International Airport (DEN) saw a record-setting 64.5 million passengers in 2018, whereas SFO saw more than 57.7 million travelers in the same year. Drawing same sized circles was a convenient first step for us to quickly define areas around each airport, but how much of the airports’ true activity patterns were we able to capture?

The quick answer is that it doesn’t matter too much in terms of relative activity, because the normalized daily aggregated activity trends tend to be quite stable. The chart below shows the normalized aggregated daily activity within the precisely drawn boundaries of Denver International Airport (DEN) against various circle sizes and locations.

Chart 6: Normalized aggregated daily activity within precisely drawn boundaries of Denver International Airport against circles at various locations: Circle centers shift -0.01, -0.005, 0.005, 0.01 degrees from original center coordinates. Radii range from 1 km, 1.5 km, 2 km, 2.5 km, and 4 km. All series normalized to the aggregated activity from January 4 to January 31, 2020. Source: Mapbox Movement data.

MORE AIRPORTS: JFK, LGA, EWR

With the methods described above, we can expand to any airport of interest. The chart below shows the 7-day rolling average of normalized aggregated daily activity of three new airports—John F Kennedy International Airport (JFK), LaGuardia Airport (LGA), and Newark Liberty International Airport (EWR). These three airports are located within close proximity of each other. Activity in all three airports follows a similar decreasing pattern after the start of the pandemic as the one we saw for the previous four airports. We can see that activity levels at Newark Liberty International Airport and LaGuardia Airport exhibit upward trends during the re-opening phase during the summer months while activity at John F Kennedy International Airport remains relatively low. This is particularly interesting because JFK airport is directly connected to the subway and has the most domestic and international flights and time options of the three airports. For many New Yorkers, JFK airport is their go-to. The suppressed activity levels we see at JFK may be attributed to the unprecedented decline in international travel due to the travel ban.

Chart 7: 7-day rolling average of normalized aggregated daily activity at John F Kennedy International Airport (JFK), LaGuardia Airport (LGA), Newark Liberty International Airport (EWR), from January 1 to December 31, 2020. All series normalized to the aggregated activity from January 4 to January 31, 2020. Source: Mapbox Movement data.

CONCLUSION—AND MORE POSSIBILITIES

In this blog post, we’ve explored the tools and methodologies for accessing and getting the most use out of Mapbox Movement data. We also did a deep dive into how reliable the normalized daily aggregated activity index of a small area (1 km circle) is in representing the overall movement patterns of an entire airport. This means that with the tools and datasets introduced in this post, we can observe the macro patterns and draw interesting insights to better understand the most severely impacted industry by the pandemic. Here are a few other interesting ideas to explore related to the airport dataset:

  1. Compare activity trends for small and medium-sized airports against large airports in the US. We may find some interesting spatial and temporal trends in these smaller sized airports around Thanksgiving 2020.
  2. Compare activity trends for airports in other countries around the world against US airports. Mapbox Movement has coverage globally.

These same tools and methodologies can also be applied to any other geographic locations such as retail locations, parks, beaches etc. If you’re interested in trying out this tutorial and/or getting your own mobility insights, make sure to check out the Mapbox Movement Data page to download the data, grab the Jupyter notebook off GitHub or AWS S3, and get started today!

APPENDIX

I. PROJECTION SYSTEMS

Projections are mathematical transformations that take spherical coordinates (latitude and longitude) and transform them to an XY (planar) coordinate system. However, since we are transforming a coordinate system used on a curved surface to one used on a flat surface, there’s no perfect way to convert one to the other without some distortion. Some projection systems preserve shape and area, while others preserve distance and direction. Therefore, deciding on the projection system is a nontrivial task. A commonly used default is Universal Transverse Mercator (UTM), which divides the Earth into 60 predefined zones that are 6 degrees wide. Within these zones the UTM projection has very little distortion. However, the farther away from the center of the UTM zone, the greater the distortion on area and distances. In this example, we are performing distance calculations on airports across the entire US. The azimuthal equidistant projection (AEQD) is a more appropriate projection system to use. It has useful properties in that all points on the map are at proportionally correct distances from the center point, and that all points on the map are at the correct direction from the center point.

II. COMPUTE THE ADJUSTED DAILY AVERAGE ACTIVITY INDEX FOR ALL Z18 QUADKEYS

During our earlier introduction to the Mapbox Movement product, we mentioned that the activity index is calculated by aggregating anonymized location data from mobile devices. This anonymization process is implemented to protect Mapbox users and is applied on a daily basis. During this process, a small random noise is intentionally applied to total activity counts. Geographic areas with counts below a minimum threshold are dropped. This means that certain zoom 18 quadkeys may not have enough activity counts to make it past our privacy thresholds on a given day. These quadkeys will not be included in the dataset on that given day.

Because of this, we need to pay special attention when computing medians or averages. Simply applying the built-in mean() and median() methods to activity count could lead to inaccurate results because the denominator, which is based on the number of quadkeys that do have enough activity counts, will also fluctuate on a daily basis. Instead, the correct way of computing daily average would be to average across the total number of all Z18 quadkeys that overlap with the circle with 1 km radius. This is the correct denominator to use as it also takes into account the Z18 quadkeys that did not make it past our privacy thresholds on a given day. Take San Francisco International (SFO) for example,


# Find SFO from the us_airports_df_large DataFrame
sfo = us_airports_df_large[us_airports_df_large["iata_code"] == "SFO"].iloc[0]

# Get the total number of z18 quadkeys that overlap the 1 km circle polygon
total_num_z18_qks = float(len(set(sfo['z18_quadkeys'])))

# Use the sum of activity per day to calculate the adjusted daily average activity index. Add it as a new column to the stats DataFrame
sfo_ai_stats_df['adj_mean_ai_daily'] = sfo_ai_stats_df['sum_ai_daily'] / total_num_z18_qks

Plot the adjusted daily average activity indices of all four airports:

Chart 8: Adjusted average aggregated daily activity at San Francisco International Airport, Denver International Airport, Dallas/Fort Worth International Airport, and Dulles International Airport from January 1 to December 31, 2020. Source: Mapbox Movement data.

III. DEEP DIVE INTO APPROXIMATE VS. PRECISE AIRPORT BOUNDARIES

This section explores how the daily aggregated activity index and normalized curve fluctuate depending how we define an airport’s boundaries. Can we trust the normalized daily aggregated activity index of a small area (1 km circle) adequately captures the overall movement patterns of an entire airport? We will compare the statistics of various boundary sizes including one derived from the precisely drawn boundaries of an airport. 

Similar to what we did in the original airport analysis, we will need to first draw new boundaries. We can no longer use the filtered DataFrames, e.g. the z18_sample_data_sfo_df, because these DataFrames consist only of data from zoom 18 quadkeys that overlap with each respective airport’s original 1 km circle polygon. We will need activity data from a larger set of zoom 18 quadkeys. Recall that we've compressed 2020-01-01 to 2020-12-31 activity data for four airports—Dallas/Fort Worth International Airport (DFW), Denver International (DEN), Dulles International Airport (IAD). Each of these zip files consists of activity data from all zoom 18 quakeys within a single zoom 7 quadkey.

The function download_zipped_z7_sample_data_to_df.() in section III of the Appendix lets us download the zip file of a given airport and create a Pandas DataFrame to hold its Movement sample data. We will use Denver International Airport (DEN) as an example throughout this section. First download DEN’s 2020 Movement sample data:


# Load DEN Z7 quadkey sample data. Note that we are loading the unfiltered DataFrame, which includes available data from all Z18 quadkeys
airports_sample_data_path = os.path.join(os.getcwd(), 'airports_sample_data')
sample_data_df_den = download_zipped_z7_sample_data_to_df(airports_sample_data_path)

We also need a method that helps us generate a suit of new circle Polygon objects based on two factors—how much we want to shift the original center coordinates by and how large we want to make the circles. The function generate_new_circles.() in the Jupyter notebook uniformly shifts the center coordinates by the input range of decimal degrees in four directions—vertical, horizontal, left and right—and creates circle Polygon objects based on the input range of radii around these new center coordinates.

The snippet below produces 27 new circles of various center locations and radii for DEN:


# Denver International Airport
den = us_airports_df_large[us_airports_df_large['name'].str.contains("Denver")].iloc[0]

# set range of degrees to shift the new circle centers
center_r = np.round(np.arange(-0.01, 0.015, 0.005), 3)

# set range of new circle radii
size_r = [1000, 1500, 2000]

# Generate new circles for DEN
den_new_circles, den_new_circle_map = generate_new_circles(den, center_r, size_r)

This is what these circles look like on a kepler.gl map:

Map of all new circles drawn for Denver International Airport. Circle centers shift -0.01, -0.005, 0.005, 0.01 degrees from original center coordinates. Radii range from 1 km, 1.5 km, 2 km. The center circle in blue is the original circle with 1 km radius.

Similar to how we generated statistics for each of the four airports above, we will also repeat the same process for each of these new circles using the generate_daily_stats_circles.() function. Plot these statistics using the plot_daily_stats_curves.() function (you can find more detailed code in the Appendix section of the Jupyter notebook, available on GitHub and AWS S3). Observe how the daily aggregated activity index and normalized daily aggregated activity index change.

We can see that some circles have higher daily aggregated activities than others. The circle that has consistently higher daily aggregated activities than others is the one that’s been shifted 0.01 degrees south from the original center coordinates (-104.672996521, 39.861698150635) and with radii 2000 meters (labeled ‘y_0.01_2000’).

Chart 9: Aggregated daily activity at various Denver International Airport locations: Circle centers shift -0.01, -0.005, 0.005, 0.01 degrees from original center coordinates. Radii range from 1 km, 1.5 km to 2 km. Source: Mapbox Movement data.

Interestingly, the normalized curves all following a similar pattern regardless of center location or amount of area covered (plot legend shows the normalized curve for the original center coordinates with various radii).

Chart 10: Normalized aggregated daily activity at various Denver International Airport locations: Circle centers shift -0.01, -0.005, 0.005, 0.01 degrees from original center coordinates. Radii sizes range from 1 km, 1.5 km, 2 km. All series normalized to the aggregated activity from January 4 to January 31, 2020. Source: Mapbox Movement data.

Next, let's draw the precise boundary for DEN and see how its daily aggregated activity and normalized curve differ from the ones produced by various circles:

Precise boundaries of Denver International Airport (DEN)

From March, 2020 to December, 2020, the normalized curve for precise boundary (labeled prec_normalized_sum_DEN_precise in the plot below) consistently has the lowest values, compared to all the other normalized curves produced by various circles. This may be because the precise boundary covers a much larger area of DEN and therefore takes into account the lower traffic volumes near the international terminal. Travel restrictions to and from the majority of European countries went into effect on March 13th, 2020. Even though these normalized curves share similar patterns, we can still see the travel ban’s impact on the overall activity at DEN in the plot below:

Chart 11: Normalized aggregated daily activity within precisely drawn boundaries of Denver International Airport against activity within approximate boundaries (different sizes of circles at various locations: Circle centers shift -0.01, -0.005, 0.005, 0.01 degrees from original center coordinates. Radii range from 1 km, 1.5 km, 2 km). All series normalized to the aggregated activity from January 4 to January 31, 2020. Source: Mapbox Movement data.

Let’s see what happens when we increase the radii to 4 kilometers:

Map of all new circles drawn for Denver International Airport. Circle centers shift -0.01, -0.005, 0.005, 0.01 degrees from original center coordinates. Radii sizes range from 1 km, 1.5 km, 2 km, 4 km. The center circle in beige is the original circle with 1 km radius.

The normalized daily sum for 4000 meters nearly overlaps with the precise boundary’s normalized daily sum and its pattern matches the normalized curves of other circles.

Chart 12: Normalized aggregated daily activity within precisely drawn boundaries of Denver International Airport against circles at various locations: Circle centers shift -0.01, -0.005, 0.005, 0.01 degrees from original center coordinates. Radii range from 1 km, 1.5 km, 2 km, 2.5 km, and 4 km. All series normalized to the aggregated activity from January 4 to January 31, 2020. Source: Mapbox Movement data.

From these experiments, we can draw two main conclusions:

  1. We can trust that the overall movement patterns for the airports as a whole are well represented by these normalized daily aggregated activity indices. The various plots above show that even though the magnitude of the daily aggregated activity index might fluctuate, the overall trends in normalized daily aggregated activity index actually does not depend on the specific choice of center coordinates and radius! They are fairly stable.
  2. As we see in the line charts, there may be some noise in individual activity levels. It’s important to understand that we are looking for macro, consistent patterns in activity levels and should avoid over-intepreting individual data points. For example, the different normalized curves have up to a 0.1 difference in the activity score in October. This means that you probably shouldn’t consider a change of that size to be particularly meaningful in your analysis unless you do a confidence level analysis, or unless you see that the pattern holds pretty cleanly and consistently. Another example of this is the retail analysis we did in the retail recovery blog post. In that blog post, we compared normalized activity indices across more than 5000 retail locations. We can see that Costco’s overall movement patterns looked very different from Macy’s. On the other hand, AMC's index greatly overlapped with Macy’s. This means that the difference between the first pair gives us stronger signals of how people’s activity patterns have changed than the difference between the second pair.

No items found.
No items found.

Related articles