Unicode in C and C++: What You Can Do About It Todayby Jeff BezansonIf you write an email in Russian and send it to somebody in Russia, it is depressingly unlikely that he or she will be able to read it. If you write software, the burden of this sad state of affairs rests on your shoulders.
Given modern hardware resources, it is unacceptable that we can't yet
routinely communicate text in different scripts or containing technical
symbols. Fortunately, we are getting there.
I. Encoding textGiven the variety of human languages on this planet, text is a complex subject. Many are scared away from dealing with world scripts, because they think of the numerous related software problems in the area instead of focusing on what they can actually do with their code to help.The first thing to know is that you do not have to worry about most problems with digital text. The most difficult work is handled below the application layer, in OSes, UI libraries, and the C library. To give you an idea of what goes on though, here is a summary of software problems surrounding text:
The encoding I'll talk about is called Unicode. Unicode officially encodes 1,114,112 characters, from 0x000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding is completely wrong.) For maximum compatibility, individual Unicode values are usually passed around as 32-bit integers (4 bytes per character), even though this is more than necessary. For convenience, the first 128 Unicode characters are the same as those in the familiar ASCII encoding. The consensus is that storing four bytes per character is wasteful, so a variety of representations have sprung up for Unicode characters. The most interesting one for C programmers is called UTF-8. UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called "UTF-8 sequence", you can convert it to a Unicode value that refers to a character. UTF-8 has the property that all existing 7-bit ASCII strings are still valid. UTF-8 only affects the meaning of bytes greater than 127, which it uses to represent higher Unicode characters. A character might require 1, 2, 3, or 4 bytes of storage depending on its value; more bytes are needed as values get larger. To store the full range of possible 32-bit characters, UTF-8 would require a whopping 6 bytes. But again, Unicode only defines characters up to 0x10FFFF, so this should never happen in practice. UTF-8 is a specific scheme for mapping a sequence of 1-4 bytes to a number from 0x000000 to 0x10FFFF: 00000000 -- 0000007F: 0xxxxxxx 00000080 -- 000007FF: 110xxxxx 10xxxxxx 00000800 -- 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx 00010000 -- 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxxThe x's are bits to be extracted from the sequence and glued together to form the final number. It is fair to say that UTF-8 is taking over the world. It is already used for filenames in Linux and is supported by all mainstream web browsers. This is not surprising considering its many nice properties:
See What is UTF-8? for more information about UTF-8. Side note: Some consider UTF-8 to be discriminatory, since it allows English text to be stored efficiently at one byte per character while other world scripts require two bytes or more. This is a troublesome point, but it should not get in the way of Unicode adoption. First of all, UTF-8 was not really designed to preferentially encode English text. It was designed to preserve compatibility with the large body of existing code that scans for special characters such as line breaks, spaces, NUL terminators, and so on. Furthermore, the encoding used internally by a program has little impact on the user as long as it is able to represent their data without loss. UTF-8 is a great boon, especially for C programming. Think of it this way: if it allows you to internationalize an application that would have been difficult to convert otherwise, it is much less discriminatory than the alternative. II. The C libraryAll recent implementations of the standard C library have lots of functions for manipulating international strings. Before reading up on them, it helps to know some vocabulary:"Multibyte character" or "multibyte string" refers to text in one of the many (possibly language-specific) encodings that exist throughout the world. A multibyte character does not necessarily require more than one byte to store; the term is merely intended to be broad enough to encompass encodings where this is the case. UTF-8 is in fact only one such encoding; the actual encoding of user input is determined by the user's current locale setting (selected as an option in a system dialog or stored as an environment variable in UNIX). Strings you get from the user will be in this encoding, and strings you pass to printf() are supposed to be as well. Strings within your program can of course be in any encoding you want, but you might have to convert them for proper display. "Wide character" or "wide character string" refers to text where each character is the same size (usually a 32-bit integer) and simply represents a Unicode character value ("code point"). This format is a known common currency that allows you to get at character values if you want to. The wprintf() family is able to work with wide character format strings, and the "%ls" format specifier for normal printf() will print wide character strings (converting them to the correct locale-specific multibyte encoding on the way out). The C library also provides functions like towupper() that can convert a wide character from any language to uppercase (if applicable). strftime() can format a date and time string appropriately for the current locale, and strcoll() can do international sorting. These and other functions that depend on locale must be initialized at the beginning of your program using #include <locale.h> int main() { char *locale; locale = setlocale(LC_ALL, ""); ... }You don't have to do anything with the locale string returned by setlocale(), but you can use it to query your user's locale settings (more on this later). The C library pretty much assumes you will be using multibyte strings throughout your program (since that's what you get as input). Since multibyte strings are opaque, a lot of functions beginning with "mb" are provided to deal with them. Personally, I don't like not knowing what encoding my strings use. One concrete problem with the multibyte thing is file I/O— a given file could be in any encoding, independent of locale. When you write a file or send data over a network, keeping the multibyte encoding might be a bad idea. (Even if all software uses only the proper locale-independent C library functions, and all platforms support all encodings internally, there is still no single standard for communicating the encoding of a piece of text; email messages and HTML tags do it in various ways.) You also might be able to do more efficient processing, or avoid rewriting code, if you knew the encoding your strings used. Your encoding options You are free to choose a string encoding for internal use in your program. The choice pretty much boils down to either UTF-8, wide (4-byte) characters, or multibyte. Each has its advantages and disadvantages:
In this article I will advocate and give explicit instruction on using UTF-8 as an internal string encoding. Many Linux users already set their environment to a UTF-8 locale, in which case you won't even have to do any conversions. Otherwise you will have to convert multibyte to wide to UTF-8 on input, and back to multibyte on output. Nevertheless, UTF-8 has its advantages. III. What to do right nowBelow I'll outline concrete steps any C programmer could take to bring his or her code up to date with respect to text encoding. I'll also be presenting a simple C library that provides the routines you need to manipulate UTF-8.Here's your to-do list:
IV. Some UTF-8 routinesVarious libraries are available for internationalization and converting between different text encodings. However, I couldn't find a straightforward set of C routines providing the minimal support needed for using UTF-8 as an internal encoding (although this functionality is often embedded in large UI toolkits and such). I decided to create a small library that could be used to bring UTF-8 to arbitrary C programs.This library is quite incomplete; you might want to look at related FSF offerings and libutf8. libutf8 provides the multibyte and wide character C library routines mentioned above, in case your C library doesn't have them. Since performance is sometimes a concern with UTF-8, I made my routines as fast and lightweight as possible. They perform minimal error checking— in particular, they do not bother to determine whether a sequence is valid UTF-8, which can actually be a security problem. I justify this decision by reiterating that the intention of the library is to manipulate an internal encoding; you can enforce that all strings you store in memory be valid UTF-8, enabling the library to make that assumption. Routines for validating and converting from/to UTF-8 are available free from Unicode, Inc. Note that my routines do not need to support the many encodings of the world—the C library can handle that. If the current locale is not UTF-8, you call mbstowcs() on user input to convert any encoding (whatever it is) to a wide character string, then use my u8_toutf8() to convert it to the UTF-8 your program is comfortable with. Here's an example input routine wrapping readline(): char *get_utf8_input() { char *line, *u8s; unsigned int *wcs; int len; line = readline(""); if (locale_is_utf8) { return line; } else { len = mbstowcs(NULL, line, 0)+1; wcs = malloc(len * sizeof(int)); mbstowcs(wcs, line, len); u8s = malloc(len * sizeof(int)); u8_toutf8(u8s, len*sizeof(int), wcs, len); free(line); free(wcs); return u8s; }The first call to mbstowcs() uses the special parameter value NULL to find the number of characters in the opaque multibyte string. Anyway, on with the routines. They are divided into four groups: Group 1: conversions /* is c the start of a utf8 sequence? */ #define isutf(c) (((c)&0xC0)!=0x80) /* convert UTF-8 to UCS-4 (4-byte wide characters) srcsz = source size in bytes, or -1 if 0-terminated sz = dest size in # of wide characters returns # characters converted */ int u8_toucs(unsigned int *dest, int sz, char *src, int srcsz); /* convert UCS-4 to UTF-8 srcsz = number of source characters, or -1 if 0-terminated sz = size of dest buffer in bytes returns # characters converted */ int u8_toutf8(char *dest, int sz, unsigned int *src, int srcsz); /* single character to UTF-8 */ int u8_wc_toutf8(char *dest, wchar_t ch);Note that the library uses "unsigned int" as its wide character type. You can convert a known number of bytes, or a NUL-terminated string. The length of a UTF-8 string is often communicated as a byte count, since that's what really matters. Recall that you can usually treat a UTF-8 string like a normal C-string with N characters (where N is the number of bytes in the UTF-8 sequence), with the possibility that some characters are >127. Group 2: moving through UTF-8 strings /* character number to byte offset */ int u8_offset(char *str, int charnum); /* byte offset to character number */ int u8_charnum(char *s, int offset); /* return next character, updating a byte-index variable */ unsigned int u8_nextchar(char *s, int *i); /* move to next character */ void u8_inc(char *s, int *i); /* move to previous character */ void u8_dec(char *s, int *i);
Group 3: Unicode escape sequences /* assuming src points to the character after a backslash, read an escape sequence, storing the result in dest and returning the number of input characters processed */ int u8_read_escape_sequence(char *src, unsigned int *dest); /* given a wide character, convert it to an ASCII escape sequence stored in buf, where buf is "sz" bytes. returns the number of characters output. */ int u8_escape_wchar(char *buf, int sz, unsigned int ch); /* convert a string "src" containing escape sequences to UTF-8 */ int u8_unescape(char *buf, int sz, char *src); /* convert UTF-8 "src" to ASCII with escape sequences. if escape_quotes is nonzero, quote characters will be preceded by backslashes as well. */ int u8_escape(char *buf, int sz, char *src, int escape_quotes); /* utility predicates used by the above */ int octal_digit(char c); int hex_digit(char c); Group 4: replacements for standard functions /* return a pointer to the first occurrence of ch in s, or NULL if not found. character index of found character returned in *charn. */ char *u8_strchr(char *s, unsigned int ch, int *charn); /* same as the above, but searches a buffer of a given size instead of a NUL-terminated string. */ char *u8_memchr(char *s, unsigned int ch, size_t sz, int *charn); /* count the number of characters in a UTF-8 string */ int u8_strlen(char *s); /* given the string returned by setlocale(), determine whether the current locale speaks UTF-8 */ int u8_is_locale_utf8(char *locale); /* these functions can print from UTF-8 strings. they make no assumptions about locale; you can circumvent them if is_locale_utf8 */ int u8_vprintf(char *fmt, va_list ap); int u8_printf(char *fmt, ...);utf8.c utf8.h |