CLiki: Unicode and Lisp

Unicode and Lisp

Random ramblings on Unicode. Most Lisp implementations support Unicode, see: Unicode Support.

In Common Lisp we have the nice 'feature' that characters are divided into sets. This is a legacy of the Symbolics Lisp Machines. In short there is CHARACTER that is all the characters an implementation can handle; and BASE-CHAR is a defined subset of CHARACTER that should contain the STANDARD-CHAR characters. Most implementations take ASCII-7 as BASE-CHAR. Implementations can have other subsets defined, so for instance they could have UNICODE-CHAR as another subset of CHARACTER.

Now this doesn't really work well with unicode. The Unicode people seem to have the following idea: the implementation internally uses whatever it likes (of course they prefer that you use the ucs2 encoding [do they? "Unicode" now includes code-points beyond #xFFFF, going up to #x10FFFF; UCS-2 is only good for the so-called Basic Multilingual Plane, which is only a subset of Unicode. And ISO-10646 is a full 31 bits wide, requiring UCS-4]) and should support a whole host of external formats. clisp seems to follow this procedure best in my (Peter Van Eynde's opinion). UTF-8 is the recommended external format.

Why is UTF-8 not good as internal representation? Mainly because it makes accesses to strings non-uniform in time, makes all strings immutable (you need to copy the string for every operation) and there is the extra load to normalise all characters (a character can have an infinite number of UTF-8 representations [No it can't; it can only have one, unless you mean that, e.g., accented characters can be a single character or a non-accented character and a combining accent, etc., in which case that's true of any encoding, including UCS-2/4]).

In general I like clisp's solution a lot.

Common Lisp extensions might be able to adopt some techniques from Shiro Kawai's gauche implementation of scheme, which has excellent multibyte string support. I realize that may not the same as unicode support, but gauche does support utf-8. Gauche has:

Multibyte string support: Strings are represented by multibyte string internally. You can use UTF-8, EUC-JP, Shift-JIS or no multibyte encoding by configure-time choice. Conversion between native coding system and external coding system is supported by port objects.
Multibyte regexp: Regular expression matcher is aware of multibyte string; you can use multibyte characters both in patterns and matched strings.
BSD license.

programming tips