Why No Modern Programming Language Should Have a 'Character' Data Type

Andrew (he/him) - May 27 '20 - - Dev Community

Photo by Henry & Co. from Pexels


Standards are useful. They quite literally allow us to communicate. If there were no standard grammar, no standard spelling, and no standard pronunciation, there would be no language. Two people expressing the same ideas would be unintelligible to one another. Similarly, without standard encodings for digital communication, there could be no internet, no world-wide web, and no DEV.to.

When digital communication was just beginning, competing encodings abounded. When all we can send along a wire are 1s and 0s, we need a way of encoding characters, numbers, and symbols within those 1s and 0s. Morse Code did this, Baudot codes did it in a different way, FIELDATA in a third way, and dozens -- if not hundreds -- of other encodings came into existence between the middle of the 19th and the middle of the 20th centuries, each with their own method for grouping 1s and 0s and translating those groups into the characters and symbols relevant to their users.

Some of these encodings, like Baudot codes, used 5 bits (binary digits, 1s and 0s) to express up to 2^5 == 32 different characters. Others, like FIELDATA, used 6 or 7 bits. Eventually, the term byte came to represent this grouping of bits, and a byte reached the modern de facto standard of the 8-bit octet. Books could be written about this slow development over decades (and many surely have been), but for our purposes, this short history will suffice.

It was this baggage that the ANSI committee (then called the American Standards Association, or ASA) had to manage while defining their new American Standard Code for Information Interchange (ASCII) encoding in 1963, as computing was quickly gaining importance for military, research, and even civilian use. ANSI decided on an 7-bit, 128-character ASCII standard, to allow plenty of space for the 52 characters (upper and lowercase) of the English language, 10 digits, and many control codes and punctuation characters.

Even though ASCII was defined as a 7-bit encoding, the popularity of 8-bit bytes meant that ASCII characters commonly included a high 8th bit which went unused. In some applications, that bit acted as a toggle to make text italic.

In spite of this seeming embarrassment of wealth with regards to defining symbols and control codes for English typists, there was one glaring omission: the remainder of the world's languages.

And so, as computing became more widespread, computer scientists in non-English-speaking countries needed their own standards. Some of them, like ISCII and VISCII, simply extended ASCII by tacking on an additional byte, but keeping the original 128 ASCII characters the same. Logographic writing systems, like Mandarin Chinese, require thousands of individual characters. Defining a standard encompassing multiple logographic languages could require multiple additional bytes tacked onto ASCII.

Computer scientists realised early on that this would be a problem. On the one hand, it would be ideal to have a single, global standard encoding. On the other hand, if 7 bits worked fine for all English-language purposes, those additional 1, 2, or 3 bytes would simply be wasted space most of the time ("zeroed out"). When these standards were being created, disk space was at a premium, and spending three quarters of it on zeroes for a global encoding was out of the question. For a few decades, different parts of the world simply used different standards.

But in the late 1980s, as the world was becoming more tightly connected and global internet usage expanded, the need for a global standard grew. What would become the Unicode consortium began at Apple in 1987, defining a 2-byte (16-bit) standard character encoding as a "wide-body ASCII":

Unicode aims in the first instance at the characters published in modern text... whose number is undoubtedly far below 2^14 = 16,384. Beyond those modern-use characters, all others may be defined to be obsolete or rare; these are better candidates for private-use registration than for congesting the public list of generally useful Unicodes.

And so Unicode fell into the same trap as ASCII in its early days: by over-narrowing its scope (focusing only on "modern-use characters") and prioritising disk space, Unicode's opinionated 16-bit standard -- declaring by fiat what would be "generally useful" -- was predestined for obsolescence.

This 2-byte encoding, "UTF-16", is still used for many applications. It's the string encoding in JavaScript and the String encoding in Java. It's used internally by Microsoft Windows. But even 16 bits' worth (65536) of characters quickly filled up, and Unicode had to be expanded to include "generally useless" characters. The encoding transformed from a fixed-width one to a variable-width one as new characters were added to Unicode.

Modern Unicode consists of over 140,000 individual characters, requiring at least 18 bits to represent. This, of course, creates a dilemma. Do we use a fixed-width 32-bit (4-byte) encoding? Or a variable-width encoding? With a variable-width encoding, how can we tell whether a sequence of 8 bytes is eight 1-byte characters or four 2-byte characters or two 4-byte characters or some combination of those?

UTF-8, the modern, variable-width incarnation of Unicode, is actually a code-within-a-code. The bit sequence in the first byte of a multi-byte character encodes within it the number of bytes in that sequence.

This is a complex problem. Because of its UTF-16 encoding, JavaScript will break apart multibyte characters if they require more than two bytes to encode:

Clearly, these are "characters" in the lay sense, but not according to UTF-16 strings. The entire body of terminology around characters in programming languages has now gotten so overcomplicated, we have characters, code points, code units, glyphs, and graphemes, all of which mean slightly different things, except sometimes they don't.

Thanks to combining marks, a single grapheme -- the closest thing to the non-CS literate person's definition of a "character" -- can contain a virtually unlimited number of UTF-16 "characters". There are multi-thousand-line libraries dedicated only to splitting text into graphemes. Any single emoji is a grapheme, but they can sometimes consist of 7 or more individual UTF-16 characters.

In my opinion, the only sensibly-defined entities in character wrangling as of today are the following:

  • "byte" -- a group of 8 bits
  • "code point" -- this is just a number, contained within the Unicode range 0x000000 - 0x10FFFF, which is mapped to a Unicode element; a code point requires between 1 to 3 bytes to represent
  • "grapheme" -- an element which takes up a single horizontal "unit" of space to display on a screen; a grapheme can consist of 1 or more code points

A code point encoded in UTF-32 is always four bytes wide and uniquely maps to a single Unicode element. A code point encoded in UTF-8 can be 1-4 bytes wide, and can compactly represent any one Unicode element. If there were no such thing as combining marks, either or both of those two standards should be enough for the foreseeable future. But the fact that combining marks can stack Unicode elements on top of each other in the same visual space blurs the definition of what a "character" really is.

You can't expect a user to know -- or care about -- the difference between a character and a grapheme.

So what are we really talking about when we define a character data type in a programming language? Is it a fixed-width integer type, like in Java? In that case, it can't possibly represent all possible graphemes and doesn't align with the layperson's understanding of "a character". If an emoji isn't a single character, what is it?

Or is a character a grapheme? In which case, the memory set aside for it can't really be bounded, because any number of combining marks could be added to it. In this sense, a grapheme is just a string with some unusual restrictions on it.

Why do you need a character type in your programming language anyway? If you want to loop over code points, just do that. If you want to check for the existence of a code point, you can also do that without inventing a character type. If you want the "length" of a string, you'd better define what you mean -- do you want the horizontal visual space it takes up (number of graphemes)? Or do you want the number of bytes it takes up in memory? Something else maybe?

Either way, the notion of a "character" in computer science has become so confused and disconnected from the intuitive notion, I believe it should be abandoned entirely. Graphemes and code points are the only sensible way forward.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .