Mapbox GL JS now supports Arabic and Hebrew text with an optional plugin, and the next releases of our iOS and Android SDKs will include support out of the box. This also extends to languages like Persian (Farsi) and Urdu that use the Arabic script. In this post, I’d like to share why these scripts presented a technical challenge for us, and how we developed a solution.
OpenGL, Unicode Codepoints, and Glyphs
Incorporating support for Arabic and Hebrew text was a non-trivial project because our maps render everything with direct instructions to 3D graphics hardware using OpenGL, which doesn’t provide any support at all for rendering text. Instead of relying on the operating system or the browser to do the hard work for us, we’ve had to implement our own text rendering code.
Let’s go through how we’d render the label “Hey” on a map.
The input we get is encoded as a Unicode string that looks like:
Each of the numeric codes above is called a “codepoint,” and that’s the basic input we work with. A codepoint represents the concept of a character, while we call the actual visual representation of a character a “glyph”. To render “Hey,” we need to convert codepoints to glyphs, and then we need to tell the graphics card exactly where to place each of those glyphs (so the letters line up next to each other). Here’s roughly how we do it:
We ask the Mapbox font server to give us the glyphs for the codepoints “H,” “e,” and “y”
The server looks up the glyphs for the codepoints in the appropriate font
Arabic and Hebrew are both written from right-to-left (RTL) instead of the left-to-right (LTR). Under the hood, label strings are stored in “logical order,” which means the characters in the string are stored in the same order they would be read (you could also call it “first-to-last” order). Since our layout algorithm starts at the left and moves right, we’re implicitly converting “first-to-last” to “left-to-right,” and the text ends up backwards.
We can’t just reverse the order of the characters because we have to handle the case where LTR text is mixed with RTL text. Mixed LTR/RTL text is called “bidirectional text”. Bidirectional text can show up when you have a multilingual label, but it also shows up in the common case of an Arabic label that includes numbers, because Arabic numerals are actually written LTR (!). Once you start doing layout across multiple lines, “reversing just the RTL text” becomes surprisingly difficult to do. If the top line is RTL text, should you start laying out the next line from the right, even if the characters on the next line are mostly LTR text? What if the LTR characters are within parentheses and are followed by more RTL text?
The characters are now in the correct right-to-left order (even though we printed them starting from the left). If we had rendered the Hebrew cognate “שָׁלוֹם” (Shalom), we’d be done by now, but in Arabic there’s still more work to do to make the characters legible.
In printed Arabic, each character can have an “isolated,” “initial,” “medial,” and “final” form. As an example, here are the four forms for the Arabic letter “meem” (U+0645).
The form you choose depends on the surrounding characters. If you select the right forms and place them next to each other, the words will appear gracefully connected, as if written in cursive.
Arabic fonts store all four of these glyphs for the single codepoint for “meem” (U+0645). Choosing the right glyph to display for the codepoint “meem” based on the surrounding codepoints is the core of the problem of complex text layout.
Normally, we wouldn’t be able to do complex text layout without using a library like Harfbuzz and having access to the “shaping tables” for the font, but in Arabic a fortuitous historical accident gives us an easy way out. When the Unicode encoding was standardized, one of its design goals was to provide an equivalent Unicode codepoint for every codepoint that existed in one of the then-current national encodings. Early Arabic encodings avoided the complex text shaping problem by assigning a codepoint for every single glyph (at the cost of making word processing a lot more complicated since editing one character also required editing surrounding characters). To support these original Arabic encodings, Unicode introduced what it calls the “presentation forms” of Arabic letters, where each codepoint represents exactly one form/glyph.
These “presentation form” codepoints aren’t normally used in writing Arabic, but if we know the rules of Arabic, we can take any “normal” string of Arabic text and replace all of the codepoints with the appropriate “presentation form” codepoint. By doing so, we remove all ambiguity about which glyph goes with which codepoint. Again, we are lucky that ICU will do this transformation for us automatically.
Here we combine ICU’s Arabic shaping with the bidirectional transformation:
[no line breaks]) ->
U+FEE1 ("Meem Isolated")
U+FEFC ("Lam with Alef Combined")
U+FEB3 ("Seen Initial")
Using ICU in Mapbox GL JS
The GL JS version of this functionality just shipped in v0.32.1, and the functionality will ship in the next versions of our mobile SDKs. If you’re interested in details, contact us or follow these developments on GitHub. After launch we’ll keep working on expanding the typographical capabilities of the map. Some of the features at the top of our list are: