We have been extending our satellite mosaicking production infrastructure while working on the next version of Cloudless Atlas. The new version of Cloudless Atlas is using Landsat imagery, giving us a half-trillion new cloud-free pixels down to zoom level 12. We have focused on open sourcing key parts of our imagery processing pipeline so that we can work together with other players in the satellite space. A big part of this new processing work is Rasterio, which I wrote for pushing and pulling raster data — all in Python.

Preview of the New Cloudless Atlas, South Africa

Imagining the Cloudless pipeline

When building the Cloudless Atlas pipeline, we knew we would be using the GDAL library and its utilities extensively. GDAL is fast and dependable and open source. Open source was going to be important for the Cloudless Atlas effort. For the kind of extremely flexible pipeline we needed (in terms of rapid development, retargeting, and so forth), licensed software wasn’t even an option.

It was also clear that while we would be using GDAL workhorses like gdal_translate and gdalwarp in the pipeline, the uniqueness of our final product and the need to scale in production would require some new software development. And finally, we were certain that we would being using Python quite a bit, if only to build prototypes and experiment with algorithms, and would be using Numpy in particular. (Python and Numpy are part of the satellite team’s special sauce.)

To get satellite imagery into Numpy arrays for use with Python, we were going to need our own bridge between GDAL and Python. GDAL has useful Python bindings, but we took the opportunity to design a more fun, more productive, and more forward-looking alternative.

First steps for Rasterio

Rasterio began as a small module of functions to read Numpy arrays from GeoTIFF files and write arrays back to other files. It had a rasterio.open() function modeled after Python’s built-in open(), with familiar file modes like ‘r’ and ‘w’. We gave the dataset object returned by rasterio.open() a matching close() method to make resource handling more deterministic when working with enormous datasets. The read_band() and write_band() methods of dataset objects use Numpy arrays for instances of band data, replacing Band objects entirely and eliminating one of GDAL’s biggest Python gotchas: band objects that reference deleted dataset objects.

Rasterio did not wrap GDAL’s Python bindings, but used Cython, a Python-like language and generator of Python C extension code, with the GDAL C API. Cython gave Rasterio strong performance from the start and let us write less code. Other projects like Pandas use Cython for the same reason.

Early on, Rasterio moved to a public Mapbox repository and became one of Mapbox’s many open source projects. This let it catch on quickly with other Python programming raster data analysts. Asger Petersen contributed the ability to read and write raster data subsets. Early prototypes of Cloudless Atlas pipeline components soon ran quickly and smoothly enough to convince the satellite team that Rasterio could replace the default GDAL Python bindings entirely.

Cloudless-driven features

As the new Cloudless Atlas pipeline was built out, it was not long before Rasterio needed to read and write raster band masks, color tables, and metadata, and to configure GDAL’s drivers within programs. A few exploratory side-projects resulted in fun features like feature extraction and array reprojection.

To make diagnosis of file problems easier, I wrote a program to let the satellite team lift the hood of any GDAL-supported file: rio_insp.

(rasters)MapBox-FC:rasterio sean$ rio_insp tiff-error.tif
Rasterio 0.8 Interactive Inspector (Python 2.7.5)
Type "src.meta", "src.read_band(1)", or "help(src)" for more information.
>>> src.meta
{'count': 7, 'crs': {u'a': 6378137, u'lon_0': 0, u'k': 1, u'y_0': 0, u'b': 6378137, u'proj': u'merc', u'x_0': 0, u'units': u'm', u'no_defs': True}, 'dtype': <type 'numpy.uint16'>, 'driver': u'GTiff', 'transform': [-7670726.02033, 30.0, 0.0, -2504574.35408, 0.0, -30.0], 'height': 2616, 'width': 2616, 'nodata': 0.0}
>>> src.read_band(5)
array([[42501, 47716, 51599, ...,     0,     0,     0],
       [41325, 46362, 53272, ...,     0,     0,     0],
       [40202, 49743, 53713, ...,     0,     0,     0],
       ...,
       [33249, 29334, 28306, ..., 43227, 42597, 41707],
       [28848, 30354, 29512, ..., 41077, 41248, 40642],
       [28109, 27293, 26299, ..., 44889, 41707, 38990]], dtype=uint16)
>>> band5 = _
>>> band5.min(), band5.max(), band5.mean()
(0, 65535, 26459.872859268766)
>>> show(band5, cmap='pink')

Every pixel in the final Cloudless Atlas product, and every one of the many more pixels that did not make the final cut, passed through Rasterio’s read_band() and write_band() methods. Rasterio has proven as fast and dependable as the GDAL it’s built on.

Investing more in Rasterio

Rasterio continues to grow from outside Mapbox as well as inside. Brendan Ward contributed the inverse of feature extraction: rasterization. His pull request validated for me the choice of Cython. And now not only did Rasterio have another developer writing a bunch of Cython code, but one who was doing the development and testing on Windows. Yes, Rasterio works on Windows!

Rasterio is designed to be worth Python programmers’ time. Its adherence to Python patterns and idioms makes it easy to learn and use. It is well tested, cross-platform, ready for Python 3, and proven on a large scale image processing project. Please try it out, watch its commits, file bugs, and contribute code!

At the onset of the Cloudless Atlas effort, the satellite team had a few broad objectives. We wanted an excellent product, we wanted a killer processing pipeline, and we wanted to share both the tools and tradecraft that we developed along the way. Enjoy making great maps with Mapbox’s Cloudless Atlas, and join us in making Rasterio the best possible raster data library for Python.

Feel free to get in touch with me about Rasterio or anything related – email me! or just ping me on twitter @sgillies