I’ve been tracking geotagged tweets from Twitter’s public API for the last three and a half years.
There are about 10 million public geotagged tweets every day, which is about 120 per second,
up from about 3 million a day when I first started watching. The accumulated history adds up to nearly
three terabytes of compressed JSON and is growing by four gigabytes a day.
And here is what those 6,341,973,478 tweets look like on a map,
at any scale you want.
I’ve open sourced the tools I used to manipulate the data
and did all the design work in Mapbox Studio Classic. Here’s how you can make one like it yourself.
You can follow Twitter’s stream of geotagged public tweets
using the “statuses/filter” API
to request tweets from a particular bounding box or the whole world.
Before you can connect, you have to register a Twitter API key
and authenticate using it.
I couldn’t find a simple library last year to generate the OAuth header for Twitter authentication,
so I wrote this one. Once you have authenticated and connected to the
filter API, you receive a steady stream of tweets in JSON format. They include a lot more metadata than you
necessarily need to make a dot map, so I’ve been using
this program to parse the JSON
and pull out just each tweet’s username, date, time, location, client, and text.
Filtering the data
Even though there are six billion tweets to map, only 9% of them are ultimately visible as unique dots.
The others are filtered out as duplicate or near-duplicate locations.
For instance, every Foursquare check-in to a particular venue is tagged with the same location,
and it doesn’t help the map to draw that same dot over and over.
Showing the same person tweeting many times within a few hundred feet also makes the map
very splotchy, so I filter out those near-duplicates too.
In addition, if you plot all the tweet locations without any filtering, tweets from iPhones
show severe banding from either the latitude or the longitude being snapped to a grid.
The bands must be the result of fuzzed location data to avoid revealing people’s exact locations,
but they are very visually obtrusive. I eliminate most of the banding by letting each unique latitude
and longitude only appear once on the map, and dropping any additional tweets that try to reuse one.
Here is the code that I use
to deband and deduplicate tweets.
Banding and splotching in unfiltered tweets
I thought there must be a bug in my debanding code, because if you
zoom in on London, there
is a very visible blank stripe at the Prime Meridian where almost no tweets appear.
But the same stripe also shows up in the unfiltered tweets, so it must be Twitter
that is filtering them out.
Missing data at the Prime Meridian
Making vector tiles
The challenge of making dot maps is how to include all the detail when zoomed in deeply while
unobtrusively dropping dots as you zoom out so that the low zoom levels are not overwhelmingly dense.
I’ve been working on a new tool called Tippecanoe
for making vector tiles from large data sets whose features don’t have any inherent scale ranking.
You give it a file or stream of GeoJSON input, and it gives you back a vector mbtiles file
to show what your data looks like at any scale.
In the case of point features, it drops exponentially more dots at each lower zoom level,
randomly chosen but consistent from one zoom to the next, so that by the time you get to zoom level 0,
where the whole world is a single map tile, there are only 1586 dots remaining from the 590 million
that are spread across the 4.5 million tiles at zoom level 14.
At zoom level 14 and below, the dot sizes are all the same, because tippecanoe has taken care of
decreasing the density at each zoom level to 40% of the dots in the level above it. The style is
responsible for making the dots larger at each level you zoom in beyond 14:
The mysterious multiplier 1.58 by which the dot diameter increases with each level
comes from the square root of 2.5, the inverse of the 40% of dots that survive at each lower zoom level,
so the area of each dot grows by 2.5 with each zoom level.
I still don’t know why 2.5 is the appropriate rate, but many
data sets, including population density, seem to fall off at about this same rate.
You can use a different number if something else looks better for your data.
The color of the dots is applied indirectly by letting their alpha channel accumulate
as dots overlap,
and then using the colorize-alpha image filter to apply colors to halves of the alpha range:
Alpha blending only gives limited control over the brightness ramp, but an opacity of 0.2
reaches 50% brightness with 3 overlapping dots and 97% brightness with 16 dots,
which works out pretty well for the density of tweets.
The image-filter assigns the bottom half of the alpha range to go from transparent to green
and the top half from green to white, so that the densest areas get an extra glow.
The green in the middle is only half-opaque, but
RGB green is inherently bright enough that it still looks reasonably clear. It would be
hard to see if it were blue instead.
Finally, underneath the data layer for context are a desaturated satellite image from Mapbox Satellite and street names from Mapbox Streets. All the layers are rendered together from a single style sheet.