Bulk open data access is different from programmatic and casual access, but all three are key elements of a sophisticated spatial data infrastructure. The extent of this difference becomes painfully clear when you start trying to access dozens of terabytes of data, which is what I have been doing for the past four months as the research and acquisition lead behind the launch of MapBox Satellite, created entirely from public domain open data. We need to be talking more about bulk data access within the open data space. Concretely, open data projects should include bulk data access and retrieval options that are as free and open as the open data itself.
The launch of MapBox Satellite was a huge open data success story for us, and it was only possible because of federal, state, and local open data efforts within the United States. Fully embracing open government, the United States releases high resolution aerial imagery under a free and open license through programs such as the National Agriculture Imagery Program and High Resolution Orthoimagery. Acquiring high-resolution aerial photography for the continental United States involved every state’s GIS agency, the USDA Farm Services Agency, and the U.S. Geological Survey.
Open data is not truly “open” if it is inaccessible. As the open data space matures, domestically and internationally, we need to start talking about best practices for making open data more open and accessible. I’m going to go over a few of the key takeaways from my recent work, highlighting some successes and failures in providing geodata at the state and local levels in the United States. First off, let’s define our key terms:
The Open Knowledge Definition provides a simple, yet comprehensive, definition we can apply to open data:
A piece of content or data is open if anyone is free to use, reuse, and redistribute it — subject only, at most, to the requirement to attribute and/or share-alike.
The degree of openness can be measured along multiple dimensions, including access; redistribution; reuse; non-discrimination against persons, groups, or fields of endeavors; and, among others, attribution. I am going to focus on the first aspect, access, operating under three assumptions:
- There are different types of open data users.
- Different users users have different needs and abilities.
- Data accessibility matters.
There are three main types of open government data users I am considering.
Casual data users
Casual users are the least technical users, who discover new datasets from a government data portal or geoportal. Their open data needs are self-contained and usually satisfied by existing data infrastructure, provided basic needs, like ability to query and download, are met.
Programmatic data users
Programmatic users have some technology background and are no stranger to APIs. Such users might need to download a few time sensitive items on a recurring basis, or to download certain datasets based on the results of an external query. Their open data needs could be satisfied by a well maintained data portal and a data access/download API, similar to the USGS’ Seamless Data Warehouse Application Services.
Bulk data users
Bulk data users are technologically inclined, like Programmatic users, but need large datasets, sometimes entire datasets, of varying sizes, and all at once.
Geoportals, data discovery websites with a strong emphasis on geospatial data, do not help but hinder the bulk data user, since the goal is everything, not an intersection. Data APIs simply cannot scale to meet their needs. Try downloading a JSON version of the NYC Building Perimeter Outlines from NYC’s open data site using the Socrata Open Data API, or attempting to order statewide orthoimagery coverage using The National Map. Bulk data users might often resort to sending in physical storage devices to the data provider, which is the policy for orthoimagery from Connecticut’s Department of Energy and Environmental Protection.
Geoportals and (in)accessible open data
Many federal, state, and local government agencies have invested in developing geoportals as endpoints for accessing their data, enabling users to run spatial queries, visualize, and sometimes, download, the agency’s data holdings.
Many geoportals are still functionally in beta stages, meaning they are sometimes buggy and unpredictable, but are actively being developed. A good example is FEMA’s Flood Mapping Inventory, which provides a download endpoint for first responders to access FEMA’s fused orthoimagery. FEMA’s portal was formerly connected to the USGS seamless server, which was replaced by the National Map.
FEMA’s Imagery Download Toolkit and many other geoportals, like Massachussets GIS, share a strong emphasis on data acquisition through user interaction with the portal’s interface. Access is not granted until the user successfully completes certain steps in a particular order - selecting, for example, the county, neighborhood, year, dataset, metadata type, and output file format - that the user is presented with the option to download the dataset.
The National Map, maintained by the USGS National Geospatial Program, provides free, public domain national-level datasets from a central location. The National Map can potentially save a lot of time for many types of open data users - providing in one location what visitors would otherwise have to obtain from each state. In practice, the National Map follows the same pattern of many other geoportals, require users complete many steps to get from data discovery to selection, to ordering, and then finally downloading.
Visitors select an area of the United States, choose among the layers intersecting with that area, and order subsections of the national dataset covering that area. After a short period of time, the National Map sends an email with about 40 individual download links, depending on the file size of the dataset ordered. The combination of these files creates the mosaic for the area of interest. Users then download each individual link - a task which would be all but impossible without the help of cURL or Wget. To obtain continuous coverage, say, for an entire state, users must repeat the process, with no clear way to eliminate overlapping areas.
|[![The National Map
The National Map is a great tool for data discovery, access, and acquisition, but it is not well suited for bulk data retrieval. In the USGS’ defense, they do have a well-staffed bulk ordering office at the Earth Resources Observation and Science (EROS) Center, which offers bulk users the option of mailing in hard drives.
A harsh reality of the acquisition process was that the government websites are not always reliable or consistent. Links break, sites go down, and data disappears. One of the most valuable tools I used to combat this reality was the Internet Archive’s WayBackMachine, which provides open access to the more than 150 billion web sites archived since 1996. On a given day, at least one of the government websites I went to was down for maintenance or had broken links leading to no-longer-existing pages. The WayBackMachine was instrumental in these times, enabling us to find the original content, and in some instances, even download the original dataset.
Bulk data accessibility
There are many state and local agency data portals that are accessible to the bulk data user, as well as to programmatic and general users. It’s worthwhile to look at a few success stories.
||Utah Mapping Portal](https://farm9.staticflickr.com/8479/8261351769_61372b52bc_o.jpg)i
New Hampshire’s Statewide GIS Clearinghouse has a single page listing all of its free datasets, the datasets’ access methods, and availability.
Brazil’s Instituto Brasileiro de Geografia e Estatística has a single page listing all of IBGE’s available datasets, with direct links to the FTP directory for data available to download.
For orthoimagery, New Hampshire offers users the option of sending in a hard drive, or retrieving the source files via FTP.
Challenges and opportunities
This overview of open data efforts examines existing mechanisms for data access and acquisition with an appreciation of the possibility of new types of open data uses and users. Heavy interface-driven sites based on predetermined data acquisition processes are a key part of a local, state, and national spatial and open data infrastructures, but they do not make open data open and accessible to all.