Monday, February 7, 2011

UTF-8 good, UTF-16 bad

Lately I’ve been programming in Go. I started by writing a few popular Unix tools in Go, before moving on to a lexer that supports structural regular expressions and UTF-8.

Unicode support was easy to add because Go lives and breathes UTF-8. But why did the Go designers favour UTF-8, and not UTF-16? Widespread software such as Windows, Java, and Python use UTF-16 internally, so why not Go?

Further reading revealed the answer. UTF-16 is a historical accident that persists mainly due to inertia. UTF-16 has no practical advantages over UTF-8, and it is worse in some ways. The Go designers knew what they were doing, so they chose UTF-8.

Down with UTF-16

Before, from vague hunches, I had thought that UTF-16 was simpler because each character took exactly 2 bytes (so it handled only a strict subset of Unicode), and that UTF-16 was more compact for certain languages.

The first statement is patently false. UTF-16 is a variable-length encoding, and hence not much simpler than UTF-8. As for the second: consider HTML. The amount of whitespace and tags in a typical document is high enough that UTF-8 is more compact than UTF-16. This generalizes to other formats, such as Go source code.

So there’s no upside. But there are plenty of downsides to UTF-16:

  1. We must worry about byte ordering because UTF-16 deals with 16 bits at a time. A related drawback is the loss of self-synchronization at the byte level.

  2. We lose backwards compatibility with ASCII.

  3. We lose backwards compatibility with code treating NUL as a string terminator.

  4. We cannot losslessly convert from byte streams.

  5. The encoding is inflexible: hopefully 0x110000 code points should be enough, but if it were extended (again), UTF-16 would need major surgery. In contrast, by simply allowing up to 6 bytes, UTF-8 can handle up to 231 code points.

  6. Ugly encoding: it’s the reason why U+D800 to U+DFFF is reserved (though thankfully UTF-16 is self-synchronizing with 16-bit granularity).

I tired of thinking about the drawbacks of UTF-16, but there are likely more. Thus even if we only worked with a strange set of documents where UTF-16 is more efficient, the savings are too small to justify the complexity.

A living dead standard

How did so many projects settle on such a horrible standard? It’s a sad story. The original Unicode supported exactly 0x10000 code points, making UCS-2 (the precursor to UTF-16) attractive. Each character neatly corresponded to one 16-bit word. Although UTF-8 might still be preferable thanks to byte ordering and ASCII compatibility, UCS-2 is certifiably sane. People built systems based on UCS-2, and rejoiced that the codepage Stone Age was over.

Then an unnatural disaster struck: the Unicode Consortium decided more code points were needed. The supple UTF-8 got away unscathed, but there’s no way to squeeze more than 16 bits into UCS-2. Apart from abandoning UCS-2, the only recourse was to reserve a chunk of values for encoding the higher code points. Thus UTF-16 was born. Hopefully, developers could simply tweak masses of existing UCS-2 code for the new standard. Sadly, various UTF-16 bugs suggest that much code remains unaware UTF-16 is a variable-length encoding.

Despite the trauma it would cause, they should have killed UCS-2 when they had the chance. Now we have a grotesque encoding scheme that shambles alongside UTF-8.

Gauss was here

According to Wikipedia, UTF-8 was built on ASCII by Rob Pike and Ken Thompson. ASCII was developed from telegraph codes by a committee. The telegraph codes stemmed from a Western Union code, which evolved from a code by Donald Murray, who modified a 5-bit code invented by Émile Baudot, who based his work on a code designed by Carl Friedrich Gauss and Wilhelm Weber. It’s awe-inspiring that the Prince of Mathematicians was involved. And if he were alive today, I bet he’d disparage UTF-16!