We have been extending our satellite mosaicking production infrastructure while working on the next version of Cloudless Atlas. The new version of Cloudless Atlas is using Landsat imagery, giving us a half-trillion new cloud-free pixels down to zoom level 12. We have focused on open sourcing key parts of our imagery processing pipeline so that we can work together with other players in the satellite space. A big part of this new processing work is Rasterio, which I wrote for pushing and pulling raster data — all in Python.
Preview of the New Cloudless Atlas, South Africa
Imagining the Cloudless pipeline
When building the Cloudless Atlas pipeline, we knew we would be using the
GDAL library and its utilities extensively. GDAL is fast and dependable and
open source. Open source was going to be important for the Cloudless
Atlas effort. For the kind of extremely flexible pipeline we needed (in terms
of rapid development, retargeting, and so forth), licensed software wasn’t
even an option.
It was also clear that while we would be using GDAL workhorses like
gdal_translate and gdalwarp in the pipeline, the uniqueness of our final
product and the need to scale in production would require some new software
development. And finally, we were certain that we would being using Python
quite a bit, if only to build prototypes and experiment with algorithms, and
would be using Numpy in particular. (Python and Numpy are part of the satellite
team’s special sauce.)
To get satellite imagery into Numpy arrays for use with Python, we were going
to need our own bridge between GDAL and Python. GDAL has useful Python
bindings, but we took the opportunity to design a more fun, more productive,
and more forward-looking alternative.
First steps for Rasterio
Rasterio began as a small module of
functions to read Numpy arrays from GeoTIFF files and write arrays back to
other files. It had a rasterio.open() function modeled after Python’s
built-in open(), with familiar file modes like ‘r’ and ‘w’. We gave the
dataset object returned by rasterio.open() a matching close()
method to make resource handling more deterministic when working with enormous
datasets. The read_band() and write_band() methods of dataset objects use
Numpy arrays for instances of band data, replacing Band objects entirely and eliminating one of GDAL’s biggest Python gotchas: band objects that reference
deleted dataset objects.
Rasterio did not wrap GDAL’s Python bindings, but used
Cython, a Python-like language and generator of Python
C extension code, with the GDAL C API. Cython gave Rasterio strong performance
from the start and let us write less code. Other projects like
Pandas use Cython for the same reason.
Early on, Rasterio moved to a public Mapbox repository and became one of
Mapbox’s many open source projects. This let it catch on quickly with other
Python programming raster data analysts. Asger
Petersen contributed the ability
to read and write raster data subsets. Early prototypes of Cloudless Atlas
pipeline components soon ran quickly and smoothly enough to convince the
satellite team that Rasterio could replace the default GDAL Python bindings
Every pixel in the final Cloudless Atlas product, and every one of the many
more pixels that did not make the final cut, passed through Rasterio’s
read_band() and write_band() methods. Rasterio has proven as fast and
dependable as the GDAL it’s built on.
Investing more in Rasterio
Rasterio continues to grow from outside Mapbox as well as inside. Brendan
Ward contributed the inverse of
His pull request validated for me the choice of Cython. And now not only did
Rasterio have another developer writing a bunch of Cython code, but one who was
doing the development and testing on Windows. Yes, Rasterio works on Windows!
Rasterio is designed to be worth Python programmers’ time. Its adherence to
Python patterns and idioms makes it easy to learn and use. It is well tested,
cross-platform, ready for Python 3, and proven on a large scale image
processing project. Please try it out,
watch its commits, file bugs, and contribute code!
At the onset of the Cloudless Atlas effort, the satellite team had a few
broad objectives. We wanted an excellent product, we wanted a killer processing
pipeline, and we wanted to share both the tools and tradecraft that we
developed along the way. Enjoy making great maps with Mapbox’s Cloudless
Atlas, and join us in making Rasterio the best possible raster data library
Feel free to get in touch with me about Rasterio or anything related – email
me! or just ping me on twitter @sgillies