We have one simple mission on the Geocoding team: know about every place in the world. To accomplish our goal, we incorporate place data from dozens of sources and leverage wikidata to handle requests in hundreds of different languages. As a pleasant side-effect of teaching our geocoder about the world, we also get to dig into the little-known locations and quirks of language that make places unique.
Inspired by a tweet, we got together to round up some highlights from the last few months:
While working on the strasse quest, I was looking up different streets in Germany to collect test data and came across Heerstrasse, Bonn, Germany. The next thing you know, I had spent an hour looking at pictures of this street and making a mental note to visit Germany in April (sometime in the future)!
Lately, I’ve loaded translations in, among others, the Azerbaijani language (sometimes called Azeri), a Turkic language spoken as the official language in Azerbaijan, and by Azerbaijani communities in Russia, Iran, Georgia, and Turkey. As with other languages in the region, repeated conquest and annexation have exerted significant influence on it, and one consequence is that it has been written in multiple scripts over time. While not currently truly digraphic like Serbian is, Azerbaijani is fascinating in just how many times it has changed:
it was first written in the Persian version of the Arabic script
after Azerbaijan was incorporated into the USSR, Soviet officials first created a Turkish-inspired Latin alphabet in 1929, hoping to create cultural division between Azerbaijanis and Iran
under Stalin, most non-Cyrillic users were forced to switch to Cyrillic, and that included Azerbaijani, which began to be written using the Cyrillic alphabet in 1939 (and saw the introduction of a different Cyrillic scheme in 1958)
After the fall of the USSR, Azerbaijani switched back to a Latin system, but a different one than had been in use pre-USSR, in 1991, and it was further revised in 1992.
When loading geospatial data on China, it’s often difficult for a westerner to research the administrative areas because it’s not always clear how their names should be transliterated into a Latin script. So, when a new piece of data doesn’t match what we’d had in the past, it isn’t clear why. Is the new data incorrect? Has the name of the area changed? More commonly, it turns out that our old and new data really does match, but is transliterated slightly differently.
One good example of a discrepancy that had me baffled was an area in the province of “Xīzàng Zizhiqu,” aka the “Western Tsang Autonomous Region.” Western readers probably recognize it by a different name: “Tibet.” Around this time of year, the area is celebrating the Nyingchi Peach Blossom Festival:
In some newly-imported data, the county in the above photo is called “Nyingchi Xian,” but we previously had it labeled “Línzhīxian.” In both cases, “Xian” means county, but it is sometimes written as a separate word and other times written as a suffix of the name. That’s a stylistic choice, since Chinese writing doesn’t break at word boundaries. But why Nyingchi vs Linzhi? Some wikipedia research reveals that Nyingchi (ཉིང་ཁྲི་ས།) is a Tibetan name and that Línzhī (林芝) is a Chinese name for the same area.
This particular case was also a good example of how China’s administrative structures have been changing rapidly in recent years. Although the data I was importing was only a couple of years old, it seemed at odds with current reality: there apparently is no Nyingchi County or Línzhī County…
This is where wiki edit histories can be really instructive (credit to Minh Nguyễn from our mobile team for teasing this out). Originally, Wikipedia had articles about “Bayi” (town) in “Nyingchi County” in “Nyingchi Prefecture.” Then, in 2015, these articles were renamed to Bayi Subdistrict, Bayi District, and Nyingchi, respectively. The Chinese Wikipedia renamed its articles at about the same time, citing this Chinese news article (Google Translate). Apparently, Nyingchi was upgraded from a county to a prefecture-level city, so its seat of government was upgraded from a town inside a county to a subdistrict inside a district.
We separate our data into different types, like pois (“points of interest”, like restaurants and tourist destinations), places (cities), and regions (like “states” in the United States, or “departments” in France). When we’re loading new data into our geocoder, part of the challenge is figuring out what type to give each feature. While importing new data on Malaysian cities, I came across a polygon named Batu Caves. Batu Caves sounds more like a POI than a city, so I did a little digging and 😻:
It’s the name of a town that hosts an epic temple complex built into limestone caverns. The caves feature a number of endemic animals, including tube-dwelling spiders and several species of bats. In this case it’s also a great fit for our place layer- it’s a rapidly growing town just outside of Kuala Lumpur, and has multiple schools, parks, and neighborhoods.
An oxbow bend in the Chao Phraya river forms Bang Kachao, a 20 sq km park directly across from central Bangkok. Developers have been clamoring for access, but for the moment it remains an interesting contrast to the enormous city that surrounds it.
I’ve been working on removing place duplicates from our API. We sometimes return duplicate place names where the place and the region share the same name. Sometimes these are legit (“New York, New York, USA”) and sometimes they are not (“London, London, England”).
While working with a customer enquiring about Liechtenstein address support I discovered a new class of duplicates where a locality, place, region, and country add share the same name.
Previously a search for “Liechtenstein” would return
At one point we were returning the feature on the left (Alberta) when someone searched for the feature on the right (Aruba). Why? Well, we had added Japanese translations for our features, including アルバータ州 for Alberta. And we used an ASCII normalization library behind the scenes – mostly to simplify some operations and smooth out diacritics in Latin characters, but also for its not entirely nonexistent transliteration abilities. Alas, that library transforms アルバータ州 into Arubatazhou. Aruba looks like a matching query for that (when considered as a part of an autocomplete sequence).
We soon implemented a fix to isolate CJK character-using names from other ones, and since then have switched to an architecture that fully supports Unicode. But this is a good example of a larger class of headaches in geocoding: adding data on one side of the world can screw up queries on the other in a ways that don’t happen with other types of geo work.
~ fin ~
We don’t have to stop with the Geocoding team; we’d love to hear about the special places you’ve come across recently, too! Hit us up on Twitter using #geobucketlist to share your geographic bucket list. Want to explore the world as part of your day job? We’re hiring!