Today we’re releasing the content testing tool we use at Mapbox to automatically review all of our technical docs.

Think of it as a pedantic robot that makes sure we write simple English, use consistent terminology, and avoid insensitive language. The project is called retext-mapbox-standard and you can download and use it right now, but we want this to be an example, so that everyone can write helpers to solve their own copyediting issues.

We started unit-testing our blog back in 2013 with code that checked metadata like author names and categories for correctness. That test suite saves us from countless mistakes, but it’s mostly concerned with easily-parseable YAML code around our posts and not their content, which is typically Markdown-formatted English prose.

With retext-mapbox-standard, we can go much deeper.

Standardizing language

retext-mapbox-standard isn’t intended to force everyone to use serial commas or ban prepositions at the end of a sentence. In its current state, it doesn’t include any punctuation or grammar rules. Instead, we focus on usage and target technical writing.

Technical writing is a particular style of English that’s usually defined by function and correctness. Our new API documentation was the case example for retext-mapbox-standard. API documentation at its best is friendly, simple, and consistent. With these goals in mind, we added four sets of rules to retext-mapbox-standard:

  1. retext-simplify: recommends simple commonplace words to replace 10 cent words.
  2. retext-equality: bans gendered language and potential slurs.
  3. acronyms and brands that should be consistently styled.
  4. words to avoid in educational writing.

These rules are collected from great projects like The National Center on Disability and Journalism, plainlanguage.gov, and Chris Coyier’s list of words to avoid in educational writing.

Here’s a typical rule, from retext-simplify, via plainlanguage.gov.

{
  "shall": {
    "replace": [
      "must", "will"
    ]
  }
}

The word ‘shall’ usually means the same as must or will but is much less common, so in our documentation we don’t use it. Some of the rules that we include are strict: we keep retext-equality’s rule against using the words ‘him’ or ‘her’ and don’t have problems enforcing it because technical documentation rarely deals with specific people.

Tuning & context

Creating a strict set of rules is tempting, but testing your content will immediately prove that context is everything. In the context of two people, “requesting” to talk is roughly the same as “asking” to talk, and “asking” is a simpler and easier word. But in a technical document, request has a well-known meaning and writing that you should ask the server for resources is at best cheeky and at worst confusing and incorrect.

Or look at our strict style guide. We refer to TIFF files as TIFF files always, even when they can also be called TIF files. The rule that enforces this will fail tests if you use the wrong form. But in our page on TIFF files, we want to write

A TIFF, or sometimes TIF, is a file format for saving raster images.

How do we make resolute programs and context-rich language agree?

First, we tune the rules globally. retext-equality’s rule against the word ‘disabled’ would be appropriate in another context - we would replace the word with ‘person with a disability’ - but the technical term disabled is embedded in the basic standard of HTML, so we need it to write documentation.

In the few cases we want to generally ban a word but use it in a single document or paragraph, we use remark-message-control to make a narrow exception. For example, our geocoding documentation needs to use the word ‘component’ to refer to a URI Component, so one paragraph is preceded by the HTML comment:

<!-- simplify disable component -->

This tells retext-mapbox-standard to let us use the word component in the paragraph that follows, but keeps our general rule against the word intact.

Working with Markdown

Context is one cause of false positives in our test suite, but there’s a more systematic problem that needs to be tackled first: culling English from the Markdown syntax around it. We write documentation, blog posts, chat messages, and nearly everything else in the Markdown language, which is a friendly middle ground between plain text and HTML. That link to Markdown looks like

[Markdown](https://daringfireball.net/projects/markdown/)

as Markdown code. And that highlighted example of Markdown is written as

```markdown
[Markdown](https://daringfireball.net/projects/markdown/)
```

with both Markdown and the Liquid template syntax we use with Jekyll.

When we write Markdown and Jekyll, we’re writing for the computer to define where a link goes or which words are bold, but also for the reader in English. Running our strict usage rules over code without separating it by these two uses would immediately fail on a link that leads to osm.org with our rule that forbids the word “O​SM” in English (because it’s a jargon acronym for the word OpenStreetMap), even though the reader doesn’t see the link’s href attribute, only its displayed text. We’d also get errors in code samples that need to refer to geojson in lowercase rather than its brand style, “GeoJSON.”

We use remark to solve this problem. Remark is a Markdown parser that doesn’t immediately generate HTML; instead, it produces JavaScript objects that represent the structure of the file, much like the Document Object Model represents the structure of HTML. All of our tests run against a parsed version of the document called an Abstract Syntax Tree and ignore everything that users won’t see. When the testing code encounters a link, it tests the text inside the link but ignores the text in its URL.

Using a parsed version of the document makes retext-mapbox-standard understand Markdown, but understanding English is another step. Simply searching for the words that we ban would incorrectly flag words that contain them. We ban just, but don’t ban justification, so justification shouldn’t produce an error. So we need to test for words, not just text.

To avoid that problem, we use retext - a tool that takes content and applies the rules of English, separating words and sentences into manipulable structures. These are even smaller pieces than those that remark creates, and they give us even more control.

An ecosystem for language tools

retext-mapbox-standard is a mix of existing tools with a few extra rules for our peculiarities: these tools - remark, retext, and the nlcst standard - are ideal building blocks for similar tools and are all open source projects from Titus Wormer. We took lots of inspiration from Katy DeCorah’s copy-cop project, and have used Joblint to keep our job postings sexism-free. Zipf’s law of common terms and Thing Explainer’s tiny vocabulary were the background music for this project.

We’re going to keep expanding the scope of retext-mapbox-standard to test more content: a particular dream of the Mapbox Studio team is to extract all of the text we use in the application interface and check it for style and spelling errors automatically.

Try out retext-mapbox-standard, use code to play with the English language, and, if you’re interested in these problems, work at Mapbox!