One of the most challenging geographic data sets I have ever worked with is the New York taxi trip origins and destinations that Chris Whong obtained through the New York Freedom of Information Law. It is huge (375 million points), tightly concentrated (nearly all within a few miles of Manhattan) and varied (with some streets only visited once and others more than a million times).

My previous map of this data was a set of static bitmap tiles rendered with Datamaps, where I could handle the wide range of densities by drawing all the dots, accumulating brightness linearly, and then taking the cube root of the accumulation to condense that wide range into the available range of pixel values. That’s hard to do with vector tiles because tiles that contain more than about 200,000 points are too big and too slow to render on demand. So high-density data has meant giving up the resolution-independence and scalability of vector tiles.

You no longer need to make that tradeoff because Tippecanoe can now selectively drop points in the densest areas to moderate their size and visual impact while retaining full detail in sparser areas. The vector tiles for the map below were made with

tippecanoe -z19 -d11 -g3 -o nyctaxi.mbtiles

for a base zoom level of 19, a tile resolution of 2048 (2^11) at that zoom, and a gamma reduction of 3, which, at each zoom level, reduces the density in areas where points are less than one pixel apart to the cube root of their original density.

How it works

Reducing density by counting points within bins doesn’t look very good, and k-means clusters or voronoi diagrams are too slow to construct for such a large number of points. What worked well was to compute the z-order index of each point, the same index that is used to divide maps into tiles, and manipulate the density along that linearization of the point locations instead of trying to do it in space. At zoom level n, the indices for points that are about a pixel apart on screen will tend to be about 2^(64-2*(n+8)) apart, and the places where this is not true, where indices jump around at tile boundaries, will average out over a large number of points.

Even along a line, you don’t have a continuous density function, you have a series of discrete points, but instead of thinking about the density of the points, you can think about the sparseness of the gaps between successive points. The algorithm goes through the points in index order, and at each point looks at the size of the gap between its index and the previous point’s index. If the difference is larger than “a pixel,” as defined above, the gap is sparse enough, so it copies that point into the tile and goes on to the next one. If the gap is too small, it needs to create a gap whose size (relative to the pixel size) is the cube of the gap that it found. So it skips over the current point and over as many more as it needs to until it gets to one that is far enough from the previous point that was added to the tile. With a sufficiently large gap created, it copies that point into the tile and goes on to check the size of the next gap.

The cube root gamma for the taxi map has nothing to do with the eye’s cube root perception of lightness, which monitor gamma compensates for. It just looked about right and reduced the tile density enough that it would work, cutting the million points in the zoom level 19 tile at Penn Station to 150,000. With data whose densities are less extreme, something closer to a square root seems more appropriate. Automatically choosing the optimal gamma for a given data set is a research project for the future.