We’re building a new data processing pipeline for analyzing GPS probe data streams. This infrastructure will let us improve the map in new ways and at huge scale. Here’s the first look at how it works.

Streaming GPS tracks are overlaid against our streets data, allowing us to identify where there is activity, like people running or driving, but no trail or road data in our map. When we find a place like this, we queue it up for our data team to look at in to-fix, our open source micro-tasking tool. Once we verify that a path or street is missing, we add it to OpenStreetMap and our map is updated in under 10 minutes.

We are designing this pipeline to run across massive activity streams that we anonymize, aggregate, and clean. Then mapreduce algorithms process the probe data in Turf in parallel against Mapbox Vector Tiles.

In the map above, I have Mapbox Streets in green and probe traces in blue. Pink signifies locations that our algorithm has identified as weakly mapped. I can adjust the tolerance of the algorithm to only suggest locations that we have the most confidence in. This map shows each pixel where we have found traces but no corresponding street. Pixels are colored from white to red based on our confidence that the given pixel is missing a street.

In this case, there appears to be a trail running along a major road that is not mapped in Mapbox Streets.

Let’s take a closer look at the trace data.

With an appropriate tolerance set, we suggest the missing roads and trails to mappers on the ground or to our data team by piping the information into the micro tasking tool to-fix.

The end result is us having the most up to date and accurate map possible.

We are just getting started with mass scale vector analysis. Expect new open source libraries over the next couple months, and an ever improving OpenStreetMap.