Last year, I explained how we use dynamic shape simplification and partitioning algorithms to render
huge polygon and polyline datasets in Mapbox GL JS.
This approach unfortunately doesn’t work for individual points: so how do we display hundreds of thousands of points in a meaningful way on a map?
Mapbox GL JS recently got a new feature to address this —
Let’s see how it can handle loading 40MB of GeoJSON data with 400,000 points in the browser:
After a few seconds loading the data, you can browse it smoothly at 60fps.
Supercluster, which now powers Mapbox GL JS.
Next, we’re porting Supercluster to C++ to bring clustering
to our iOS
and Android SDKs.
Let’s find out how Supercluster works under the hood.
Hierarchical greedy clustering
A simple and efficient way to group points into clusters for a specific zoom
is called greedy clustering. It works like this:
Start with any point from the dataset.
Find all points within a certain radius around that point.
Form a new cluster with the nearby points.
Choose a new point that isn’t part of a cluster, and repeat until we have visited all the points.
This would be expensive to do for every zoom level of the map —
for example, if zoom levels range from 0 to 18, we would have to process the whole dataset 19 times,
and clustering would get too slow on lower zoom levels
because each cluster would need to accommodate an exponentially increasing number of points.
We avoid this problem by reusing calculations. After we cluster the points on zoom level 18,
we can group the resulting clusters (with weighted centroids) into new z17 clusters.
Then we can use z17 clusters to form z16 clusters, and so on.
Since the number of points we need to process with each such step decreases exponentially,
we can cluster points for all zoom levels really fast.
This approach is called hierarchical greedy clustering,
and was popularized by Dave Leaver
with his fantastic Leaflet.markercluster plugin.
Unlike more sophisticated clustering algorithms,
it can be fast enough to handle millions of points in the browser,
and it’s good enough to use for browsing point datasets on an interactive map.
Indexing points and clusters
Implementation of this clustering approach in an interactive map
has two potentially expensive operations:
Finding all points within a certain radius.
Finding clusters on the current screen.
The simplest way to do a radius query is to loop through all the points
and save those that are close enough to our query point.
But since we need to do a lot of such queries – one for each potential cluster –
this would be too slow.
Similarly, we don’t want to loop through all the points for screen queries,
which are necessary every time we drag or zoom the map.
To do both efficiently, we must use
a spatial index —
processing points into a special data structure once
and then using it to do any number of later queries almost instantly.
Spatial indexing is an indispensable building block in many algorithms,
RBush, which asks for a separate blog post.
Each clustering step for a specific zoom level requires indexing points from the earlier step.
But after the clustering process, having an index for each zoom
allows us to instantly query clusters for any map view.
Here’s a breakdown of times for each clustering step
for the 400,000 points dataset we’ve seen in the video:
Supercluster spends less than 1 second total processing this dataset.
Since processing happens off the main thread
(in a Web Worker),
it doesn’t block map rendering.
After the data is clustered, any screen query happens instantly (less than 1ms).
Using the library
Supercluster API is minimal:
Because Supercluster works so well for Mapbox GL JS,
we released it as a standalone library so that other software can take advantage of its fast algorithms.
For example, here’s a video of it being used together with Leaflet
(which powers Mapbox.js)
to browse 6 million points loaded in the browser
You can also use it to cluster points on the server with Node.js,
and convert the resulting clusters into Mapbox vector tiles
using the vt-pbf module
by Anand Thakker.
We’re looking forward to seeing more amazing map apps with clustering from you!
And don’t hesitate to read through
the source code —
it’s around 200 lines of code.
Stay tuned for clustering support in our mobile SDKs,
and hit me up on Twitter if you have any questions or comments.