Last year, I explained how we use dynamic shape simplification and partitioning algorithms to render huge polygon and polyline datasets in Mapbox GL JS. This approach unfortunately doesn’t work for individual points: so how do we display hundreds of thousands of points in a meaningful way on a map?

Mapbox GL JS recently got a new feature to address this — point clustering. Let’s see how it can handle loading 40MB of GeoJSON data with 400,000 points in the browser:

By adjusting the style of the clusters, we can also turn this into a heatmap:

After a few seconds loading the data, you can browse it smoothly at 60fps. To enable this kind of performance, I built a new JavaScript clustering library called Supercluster, which now powers Mapbox GL JS. Next, we’re porting Supercluster to C++ to bring clustering to our iOS and Android SDKs.

Let’s find out how Supercluster works under the hood.

Hierarchical greedy clustering

A simple and efficient way to group points into clusters for a specific zoom is called greedy clustering. It works like this:

  1. Start with any point from the dataset.
  2. Find all points within a certain radius around that point.
  3. Form a new cluster with the nearby points.
  4. Choose a new point that isn’t part of a cluster, and repeat until we have visited all the points.

This would be expensive to do for every zoom level of the map — for example, if zoom levels range from 0 to 18, we would have to process the whole dataset 19 times, and clustering would get too slow on lower zoom levels because each cluster would need to accommodate an exponentially increasing number of points.

We avoid this problem by reusing calculations. After we cluster the points on zoom level 18, we can group the resulting clusters (with weighted centroids) into new z17 clusters. Then we can use z17 clusters to form z16 clusters, and so on. Since the number of points we need to process with each such step decreases exponentially, we can cluster points for all zoom levels really fast.

This approach is called hierarchical greedy clustering, and was popularized by Dave Leaver with his fantastic Leaflet.markercluster plugin. Unlike more sophisticated clustering algorithms, it can be fast enough to handle millions of points in the browser, and it’s good enough to use for browsing point datasets on an interactive map.

Indexing points and clusters

Implementation of this clustering approach in an interactive map has two potentially expensive operations:

  1. Finding all points within a certain radius.
  2. Finding clusters on the current screen.

The simplest way to do a radius query is to loop through all the points and save those that are close enough to our query point. But since we need to do a lot of such queries – one for each potential cluster – this would be too slow. Similarly, we don’t want to loop through all the points for screen queries, which are necessary every time we drag or zoom the map.

To do both efficiently, we must use a spatial index — processing points into a special data structure once and then using it to do any number of later queries almost instantly.

Spatial indexing is an indispensable building block in many algorithms, so a lot of research went into building the fastest JavaScript implementation, RBush, which asks for a separate blog post.

Each clustering step for a specific zoom level requires indexing points from the earlier step. But after the clustering process, having an index for each zoom allows us to instantly query clusters for any map view.

Clustering performance

Here’s a breakdown of times for each clustering step for the 400,000 points dataset we’ve seen in the video:

399601 points prepared in 123ms
z16: indexed in 516ms   clustered in 156ms   46805 clusters
z15: indexed in 53.4ms  clustered in 40.8ms  20310 clusters
z14: indexed in 12.4ms  clustered in 17.2ms  10632 clusters
z13: indexed in 7.9ms   clustered in 12.9ms  6578 clusters
z12: indexed in 3.4ms   clustered in 6.9ms   4514 clusters
z11: indexed in 2.4ms   clustered in 3.4ms   3252 clusters
z10: indexed in 1.4ms   clustered in 2.3ms   2329 clusters
 z9: indexed in 1ms     clustered in 1.5ms   1608 clusters
 z8: indexed in 0.6ms   clustered in 1.5ms   1083 clusters
 z7: indexed in 0.4ms   clustered in 0.7ms   721 clusters
 z6: indexed in 0.2ms   clustered in 0.4ms   421 clusters
 z5: indexed in 0.1ms   clustered in 0.2ms   202 clusters
 z4: indexed in 0.1ms   clustered in 0.1ms   82 clusters
 z3: indexed in 0ms     clustered in 0ms     34 clusters
 z2: indexed in 0ms     clustered in 0ms     14 clusters
 z1: indexed in 0ms     clustered in 0ms     7 clusters
 z0: indexed in 0ms     clustered in 0ms     3 clusters
total time: 972ms

Supercluster spends less than 1 second total processing this dataset. Since processing happens off the main thread (in a Web Worker), it doesn’t block map rendering. After the data is clustered, any screen query happens instantly (less than 1ms).

Using the library

Supercluster API is minimal:

// cluster GeoJSON points
var index = supercluster({radius: 40, maxZoom: 16}).load(geojson.features);

// get GeoJSON clusters given a bounding box and zoom
var clusters = index.getClusters([-180, -85, 180, 85], 2);

// get a JSON vector tile in the same format as GeoJSON-VT
var tile = index.getTile(7, 523, 125);

Because Supercluster works so well for Mapbox GL JS, we released it as a standalone library so that other software can take advantage of its fast algorithms. For example, here’s a video of it being used together with Leaflet (which powers Mapbox.js) to browse 6 million points loaded in the browser (source code):

You can also use it to cluster points on the server with Node.js, and convert the resulting clusters into Mapbox vector tiles using the vt-pbf module by Anand Thakker.

We’re looking forward to seeing more amazing map apps with clustering from you! And don’t hesitate to read through the source code — it’s around 200 lines of code. Stay tuned for clustering support in our mobile SDKs, and hit me up on Twitter if you have any questions or comments.