CLiki: CloserLookAtCharacters

CloserLookAtCharacters

Characters are not their codes -- Encodings and external format in Common Lisp

By default, Lisp files are text files. A text file is a file containing characters. That is, things like aleph, three, e with acute accent, exclamation point, etc.

Lisp objects are typed, so there's no problem in Lisp with manipulating character objects. But in a Unix (POSIX) file we can only store numbers between 0 and 255. Unix (POSIX) has no notion of text file. So we need to encode the characters into a coded byte stream.

There are several different coding systems for characters. The first step in defining a coding system is to specify the set of characters we want to encode. The second step is to define a bijective mapping between this set of characters and a subset of integer numbers. A third step may be to define a way to encode integer numbers bigger than 255 as a sequence of integer numbers between 0 and 255. Optionally, you may define in an additional step some meaning for additional codes that are not used so far, for purposes such as controlling output device or structuring files (defining records, blocks, etc).

For example, ASCII defines the following set of characters:

SPC  !   "   #   $   %   &   '   (   )   *   +   ,   -   .   /  
 0   1   2   3   4   5   6   7   8   9   :   ;   <   =   >   ?  
 @   A   B   C   D   E   F   G   H   I   J   K   L   M   N   O  
 P   Q   R   S   T   U   V   W   X   Y   Z   [   \   ]   ^   _  
 `   a   b   c   d   e   f   g   h   i   j   k   l   m   n   o  
 p   q   r   s   t   u   v   w   x   y   z   {   |   }   ~

ASCII defines the following mapping between characters and numbers:

 32 SPC  33  !   34  "   35  #   36  $   37  %   38  &   39  '  
 40  (   41  )   42  *   43  +   44  ,   45  -   46  .   47  /  
 48  0   49  1   50  2   51  3   52  4   53  5   54  6   55  7  
 56  8   57  9   58  :   59  ;   60  <   61  =   62  >   63  ?  
 64  @   65  A   66  B   67  C   68  D   69  E   70  F   71  G  
 72  H   73  I   74  J   75  K   76  L   77  M   78  N   79  O  
 80  P   81  Q   82  R   83  S   84  T   85  U   86  V   87  W  
 88  X   89  Y   90  Z   91  [   92  \   93  ]   94  ^   95  _  
 96  `   97  a   98  b   99  c  100  d  101  e  102  f  103  g  
104  h  105  i  106  j  107  k  108  l  109  m  110  n  111  o  
112  p  113  q  114  r  115  s  116  t  117  u  118  v  119  w  
120  x  121  y  122  z  123  {  124  |  125  }  126  ~

Since all numbers are between 0 and 255, no further processing is necessary to get a unique byte sequence for each character.

Finally, since there are unused numbers, ASCII defines some control codes:

  0 NUL    1 SOH    2 STX    3 ETX    4 EOT    5 ENQ    6 ACK    7 BEL  
  8 BS     9 TAB   10 LF    11 VT    12 FF    13 CR    14 SO    15 SI   
 16 DLE   17 DC1   18 DC2   19 DC3   20 DC4   21 NAK   22 SYN   23 ETB  
 24 CAN   25 EM    26 SUB   27 ESC   28 FS    29 GS    30 RS    31 US   
 32 SPC   33  !    34  "    35  #    36  $    37  %    38  &    39  '  
 40  (    41  )    42  *    43  +    44  ,    45  -    46  .    47  /  
 48  0    49  1    50  2    51  3    52  4    53  5    54  6    55  7  
 56  8    57  9    58  :    59  ;    60  <    61  =    62  >    63  ?  
 64  @    65  A    66  B    67  C    68  D    69  E    70  F    71  G  
 72  H    73  I    74  J    75  K    76  L    77  M    78  N    79  O  
 80  P    81  Q    82  R    83  S    84  T    85  U    86  V    87  W  
 88  X    89  Y    90  Z    91  [    92  \    93  ]    94  ^    95  _  
 96  `    97  a    98  b    99  c   100  d   101  e   102  f   103  g  
104  h   105  i   106  j   107  k   108  l   109  m   110  n   111  o  
112  p   113  q   114  r   115  s   116  t   117  u   118  v   119  w  
120  x   121  y   122  z   123  {   124  |   125  }   126  ~   127 DEL

Note however that ASCII doesn't define anything for byte values between 128 and 255.

Some coding systems such as ISO-8859-1 or the various Unicode coding systems are defined as extensions of ASCII (but a few (old) are totally different, and a few are modifications of ASCII).

The various ISO-8859 coding systems add additional characters to the ASCII character set, and map them to codes between 160 and 255. Additional control codes are defined for codes between 128 and 159.

Unicode defines a much bigger character set and maps it to integers between 0 and 1114111. Then several ways to map these integers to sequences of bytes are defined, such as UTF-7, UTF-8, UTF-16LE, UTF-16BE, etc. One interesting property of UTF-8 is ASCII compatibility, i.e. in both systems, characters which belong to the ASCII set are encoded with the same byte sequence. But this may be misleading when you try to read UTF-8 files using the ASCII encoding: it may work most of the time, but will fail as soon as you encounter a character which is not in the ASCII set.

Now given that the Common Lisp Standard Character Set is exactly the ASCII character set (how lucky!), and that ASCII is used by Unix (POSIX), it is quite possible that the default EXTERNAL-FORMAT used to read and write text files in Lisp is ASCII.

If the file contains a byte between 128 and 255, there's no corresponding character in the ASCII set, so it will raise an error. If you try to write a character outside of the ASCII set, then there will be no way to encode it and it will raise an error. If you try to read a byte between 0 and 31 (excluding 13 or 10, one or both of which may control the line termination), it may very well raise an error.

Some implementations may use other encodings and external-formats by default or by configuration, and may map control codes to pseudo-characters in order to avoid raising errors. But this is implementation specific behavior. Some implementations extend the ISO-8859-1 coding system to map the control codes to pseudo-characters so there's a 1-1 correspondence between [0,255] and these characters.

Using this external-format on these implementations, you may be able to "safely" copy binary files. But this is only valid for some very specific implementations in very specific circumstances.

In conclusion, if you want to copy a file of bytes, you should use :ELEMENT-TYPE '(UNSIGNED-BYTE 8) (assuming it does what you want on your implementation/OS). If you want to copy a text file with a specific encoding, you must give the implementation specific :EXTERNAL-FORMAT. What you've programmed is only guaranteed to work on "well formed" "text" files, i.e. text files produced by the same Lisp implementation on the same system.

What about unix file names?

Unix systems don't use characters to name files. They use sequences of bytes, not containing 0 or 47. 46 and the sequence (46 46) are reserved for the name of the current directory and the parent directory, respectively.

On 2007-01-29 19:45, on irc://irc.freenode.org/#scheme (channel now moved to Libera Chat), "Riastradh" wrote the following about unix pathnames, bytes and characters:

"Please, let's [be pedantic], because this is too important an issue to handle lightly and screw up.

On Unix, a pathname is an array of bytes, where a byte is usually an octet. (It could be a septet, theoretically, but I don't think there are any modern Unix systems where this is the case.) [Or a ninet.]

The byte #x2F has special meaning in the pathname. Note that this is the interpretation of the actual bits involved; it is not the character `/' that means anything to Unix, but the byte #x2F. Here, by `the character ``/'' ', I mean the symbol that you are shown when an ASCII text renderer displays the byte #x2F, and the graphical and semantic implications thereof.

(In Unicode, `character' is *not* a well-defined term; its definition is explicitly avoided because there are so many different possible concepts competing for the name, so none was given it. There are abstract characters, encoded characters, code points, scalar values, glyphs, default grapheme clusters, and more; this is why I explain what I mean by `the character ``/'' '.)

(Also, by `byte' I mean the smallest addressible unit of memory -- i.e. what C very misleadingly calls `char' --, and by `octet' I mean an array of eight bits.)

Now, on Unix, the various system calls and library routines that work with pathnames represent them as arrays of bytes, because that is what a pathname is on Unix. They don't know anything about text codecs, internationalization, user localization, and so on; they deal with arrays of bytes.

When writing programs that work with pathnames and perhaps present them to users, however, we usually want two sorts of higher-level abstraction: one, a structure identifiable to the programming language, partly for convenience, partly for clarity, and partly for the sake of portability; and decoding and encoding pathnames for users.

The user has some notion of `text', which must be stored physically on the file system somehow; the file system knows about bytes, which must be presented to the user in an understandable format somehow. To accomodate this, Unix has a number of environment variables (which are, incidentally, also named by byte arrays) by which the user can identify preferences in the means of translation between machine and human understanding.

For example, the user might often deal with Hebrew and not much else, and may choose to use ISO-8859-7 to store pathnames. Or another user might be uninterested in Western isolationism, and instead prefer culturally insulting approximation of Eastern languages by choosing to store pathnames in UTF-8, allowing the full range of Unicode text in his pathnames.

The operating system cares nothing about the user's sociocultural background, however, and deals only in bytes.

These two users may need to communicate at some point, and they would like their programs to present pathnames without loss of information. But on the other hand, they probably also want a consistent file system without two distinct file names (that is, byte arrays) for what they each see as the same thing -- a particular name composed of Hebrew and Latin text, say. (Unfortunately, there *must* be some information lost in the translation from Unicode to ISO-8859-7, but never mind that for now.) (By the way, I apologize if I have offended anyone by my choice of cultural derogation -- my intent was to poke fun at everyone equally.)

When writing portable programs that deal with pathnames, furthermore, we often want two different sorts of pathnames: those handed to us by the operating system or otherwise tied to the byte array model, and those that we wrote ourselves or read from user input with a higher-level notion of what text the user intends.

Now, it is not entirely clear what choices are best for the mapping between the two sorts of pathnames I described -- `text pathnames' and `byte pathnames', say.

In Scheme48 1.5, the decision is made for the programmer: whenever he constructs a pathname (or `os-string') from a byte vector, its textual interpretation is fixed by the locale when the Scheme48 image was started; whenever he constructs a pathname (or `os-string') from a textual description, its encoding in bytes is fixed by the locale at the time the Scheme48 image was started."

Implementation Specific Support for Encodings

ACL

ACL has extensive support for different text coding systems. For details see International Character Support in Allegro CL.

CLISP

CLISP is usually compiled with libiconv and Unicode support. It supports all the coding systems provided by libiconv, exported as variables from the CHARSET package. The EXTERNAL-FORMAT arguments can take a EXT:ENCODING object which is composed of the character set (as exported from CHARSET) and a line terminator mode, which indicate which control codes are used to encode a #\NEW-LINE pseudo-character.

There are command-line arguments and variables in the CUSTOM package to set various default encodings (pathnames, file contents, terminal, foreign function calls, miscellaneous).

Two functions are provided to encode and decode between strings of characters and vector of bytes: EXT:CONVERT-STRING-FROM-BYTES and EXT:CONVERT-STRING-TO-BYTES.

See the CLISP implementation notes, 31.5. Encodings.

CMUCL

CMUCL apparently supports utf-8 I/O through its simple-streams implementation, as well as iso-8859-1; its regular file streams support only iso-8859-1.

CMUCL release 20a supports Unicode strings, many more external formats on all streams, and mapping between Unicode pathnames and OS byte-sequences.

The following external formats are recognized:

 :ASCII :CP1250 :CP1251 :CP1252 :CP1253 :CP1254 :CP1255 :CP1256 :CP1257
 :CP1258 :ISO8859-1 :ISO8859-10 :ISO8859-13 :ISO8859-14 :ISO8859-15
 :ISO8859-2 :ISO8859-3 :ISO8859-4 :ISO8859-5 :ISO8859-6 :ISO8859-7
 :ISO8859-8 :ISO8859-9 :KOI8-R :MAC-CYRILLIC :MAC-GREEK :MAC-ICELANDIC
 :MAC-LATIN2 :MAC-ROMAN :MAC-TURKISH :UTF-16-BE :UTF-16-LE :UTF-16
 :UTF-32-BE :UTF-32-LE :UTF-32 :UTF-8

along with various aliases (:LATIN-1 is an alias for :ISO8859-1).

There are also composing external formats (external formats that must be combined with an external format):

 :BETA-GK :FINAL-SIGMA :CR :CRLF

Thus, an external format of '(:utf-8 :crlf) means that UTF-8 is the main encoding but CR/LF characters will be converted on input to a linefeed and on output a linefeed becomes a CR/LF sequence.

See the Internationalization chapter from the CMU User's Manual.

Clozure CL

CCL (formerly OpenMCL) supports Unicode with 32-bit characters and strings. Supported character encodings are:

(:UTF-16BE :ISO-8859-11 :ISO-8859-10 :US-ASCII :ISO-8859-2 :ISO-8859-15 :ISO-8859-9 :IBM866 :ISO-8859-7 :ISO-8859-8 :EUC-JP :UTF-16 :UTF-32 :CP936 :ISO-8859-1 :ISO-8859-14 :ISO-8859-16 :WINDOWS-31J :GB2312 :UCS-2BE :ISO-8859-13 :UCS-2LE :UTF-32LE :ISO-8859-4 :UTF-16LE :ISO-8859-3 :UCS-2 :ISO-8859-6 :UTF-32BE :ISO-8859-5 :MACINTOSH :UTF-8)

An external format can specify a character encoding and/or line termination keyword, as well as the domain where it's used.

SBCL

SBCL supports a limited number of coding systems. Because the interface has not been finalized yet, they are undocumented, but as of the 0.9.15 release the following are recognized:

(:ASCII :CP1250 :CP1251 :CP1252 :CP1253 :CP1254 :CP1255 :CP1256 :CP1257 :CP1258
 :CP437 :CP850 :CP852 :CP855 :CP857 :CP860 :CP861 :CP862 :CP863 :CP864 :CP865
 :CP866 :CP869 :CP874 :EBCDIC-US :EUC-JP :ISO-8859-1 :ISO-8859-10 :ISO-8859-11
 :ISO-8859-13 :ISO-8859-14 :ISO-8859-15 :ISO-8859-2 :ISO-8859-3 :ISO-8859-4
 :ISO-8859-5 :ISO-8859-6 :ISO-8859-7 :ISO-8859-8 :ISO-8859-9 :KOI8-R :KOI8-U
 :UCS-2BE :UCS-2LE :UTF-8 :X-MAC-CYRILLIC)

along with various aliases (so for instance :LATIN-9 is an alias for :ISO-8859-15). At present, there is no support for automatic translation of line-ending or byte-order-mark conventions.

Additionally, for in-memory conversion, SBCL provides the following functions

SB-EXT:STRING-TO-OCTETS string &key external-format --> octets
SB-EXT:OCTETS-TO-STRING octets &key external-format --> string

string -- a string.
octets -- a vector of (unsigned-byte 8).
external-format -- a keyword designating an external format (see above)

Until the interface with external formats is settled and documented (which essentially means figuring out how to handle newlines, byte-order-marks and similar), (apropos "EXTERNAL-FORMAT") is likely to be a good way to find out what is supported.

Other implementations

Please add links and indications about support of encodings in other implementations.

Working with character encodings in a semi-portable way

The Flexi-streams library can read and write character data in various single- or multi-octet encodings which can be changed on the fly.

Flexi-streams supports UTF-8, UTF-16, UTF-32, all ISO 8859 character sets, a couple of Windows code pages and US-ASCII.

External links

Wikipedia has more information about the various coding systems:

ASCII
ISO 646
ISO-8859
Unicode
OSF character set registry
IANA character set registry
ECMA Standards list, including references to a set of standard publications equivalent to the set of ISO/IEC standards for the ISO-8859-N character sets

Markus Kuhn's UTF-8 and Unicode FAQ is another very comprehensive resource.

Nolan Eakins also has an article describing how to use Unicode with Slime, Clisp, and SBCL. The article's example is defining a function with a Unicode name.

Comments from Jack Unrue: I suggest changing the "if you want to copy a file" paragraph, such that the recommedation is for verbatim file copy operations to treat every file as if it were binary, to avoid accidental changes to end-line formats in addition to character encodings. And use the word "process" instead of "copy" in the second sentence of that paragraph.

Also, it might be worth mentioning that text files may have a byte order mark if they are encoded in one of the Unicode encodings.

Categories: Online Tutorial CLISP SBCL