By default, lisp files are text files. A text file is a file containing characters. That is, things like aleph, three, e with acute accent, exclamation point, etc.
Lisp objects are typed, so there's no problem in Lisp with manipulating character objects. But in a unix (posix) file we can only store numbers between 0 and 255. Unix (posix) has no notion of text file. So we need to encode the characters into a coded byte stream.
There are several different coding systems for characters. The first step in defining a coding system is to specify the set of character we will encode. The second step is to define a bijective mapping between this set of character and a subset of integer numbers. A third step may be to define a way to encode a sequence of these integer numbers (that may be bigger than 255) to a sequence of integer numbers between 0 and 255. Optionally, you may define in an additional step some meaning for additional codes that are not used so far, for purpose such as controlling output device, of structuring files (defining records, blocks, etc).
For example, ASCII defines the following set of characters:
SPC ! " # $ % & ' ( ) * + , - . /
0 1 2 3 4 5 6 7 8 9 : ; < = > ?
@ A B C D E F G H I J K L M N O
P Q R S T U V W X Y Z [ \ ] ^ _
` a b c d e f g h i j k l m n o
p q r s t u v w x y z { | } ~
ASCII defines the following mapping between characters and numbers:
32 SPC 33 ! 34 " 35 # 36 $ 37 % 38 & 39 '
40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 /
48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7
56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ?
64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G
72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O
80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W
88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _
96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g
104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o
112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w
120 x 121 y 122 z 123 { 124 | 125 } 126 ~
Since all numbers are between 0 and 255, it uses the identity function to map them codes to a byte sequence.
Finally, since there's a number of free codes, ASCII defines some control codes:
0 NUL 1 SOH 2 STX 3 ETX 4 EOT 5 ENQ 6 ACK 7 BEL
8 BS 9 TAB 10 LF 11 VT 12 FF 13 CR 14 SO 15 SI
16 DLE 17 DC1 18 DC2 19 DC3 20 DC4 21 NAK 22 SYN 23 ETB
24 CAN 25 EM 26 SUB 27 ESC 28 FS 29 GS 30 RS 31 US
32 SPC 33 ! 34 " 35 # 36 $ 37 % 38 & 39 '
40 ( 41 ) 42 * 43 + 44 , 45 - 46 . 47 /
48 0 49 1 50 2 51 3 52 4 53 5 54 6 55 7
56 8 57 9 58 : 59 ; 60 < 61 = 62 > 63 ?
64 @ 65 A 66 B 67 C 68 D 69 E 70 F 71 G
72 H 73 I 74 J 75 K 76 L 77 M 78 N 79 O
80 P 81 Q 82 R 83 S 84 T 85 U 86 V 87 W
88 X 89 Y 90 Z 91 [ 92 \ 93 ] 94 ^ 95 _
96 ` 97 a 98 b 99 c 100 d 101 e 102 f 103 g
104 h 105 i 106 j 107 k 108 l 109 m 110 n 111 o
112 p 113 q 114 r 115 s 116 t 117 u 118 v 119 w
120 x 121 y 122 z 123 { 124 | 125 } 126 ~ 127 DEL
Note however that ASCII still doesn't define anything for bytes between 128 and 255.
Some coding system such as ISO-8859-1 or the various Unicode coding systems are defined as extensions of ASCII (but a few (old) are totally different, and a few are modifications of ASCII).
The various ISO-8859 coding system define additional characters in the character set, and map them to codes between 160 and 255, and gives meaning to additional control codes between 128 and 159.
Unicode defines a much bigger character set, and map them to integers between 0 and 1114111. Then several way to map these integers to sequences of bytes are defined, such as UTF-7, UTF-8, UTF-16LE, UTF-16BE, etc. One interesting property of UTF-8, is that any sequence of characters that are in the ASCII character set, when encoded using the rules of ASCII or the rules of UTF-8, produce the same byte sequence. But it may be misleading, because you may try to read UTF-8 files using the default external format which uses the ASCII encoding, and it may work most of the time, but fail when you encounter a character out of the ASCII set.
Now given that the Common Lisp Standard Character Set is exactly the ASCII character set (how lucky!), and that ASCII is used by unix (posix), it is quite possible that the default EXTERNAL-FORMAT used to read and write text files in lisp be ASCII encoded lines separated by a LF control code.
If the file contains a byte between 128 and 255, there's no corresponding character in the ASCII map, so it will raise an error. If you try to write a character outside of the ASCIi Character Set, then there will be no way to encode it and it will raise an error. If you try to read a byte between 0 and 31 (excluding 13 or 10, one or both of which may control the line termination), it may very well raise an error.
Some implementations may use other encoding and external-format by default or by configuration, and may map control code to pseudo-characters to try avoid raising errors, but this is implementation specific behavior. Some implementation extend the ISO-8859-1 coding system to map the control codes to pseudo-characters so there's a 1-1 correspondence between [0,255] and these character. Using this external-format on these implementation, you may be able to "safely" copy binary files. But this is only valid for some very specific implementations in very specific circumstances.
In conclusion, if you want to copy a file of bytes, you should use :ELEMENT-TYPE '(UNSIGNED-BYTE 8) (assuming it does what you want on your implementation/OS). If you want to copy a text file with a specific encoding, you must give the implementation specific :EXTERNAL-FORMAT. What you've programmed is only guaranteed to work on "well formed" "text" files, i.e. text files produced by the same lisp implementation on the same system.
Unix systems don't use characters to name files. They use sequences of bytes, not containing 0 or 47, and (46) and (46 46) are reserved, to mean the current directory and the parent directory.
On 2007-01-29 19:45, on irc://irc.freenode.org/#scheme,
Please, let's [be pedantic], because this is too important an issue to
handle lightly and screw up.
On Unix, a pathname is an array of bytes, where a byte is usually an
octet. (It could be a septet, theoretically, but I don't think there
are any modern Unix systems where this is the case.) [Or a ninet.]
he byte #x2F has special meaning in the pathname. Note that this is
the interpretation of the actual bits involved; it is not the
character `/' that means anything to Unix, but the byte #x2F. Here,
by `the character ``/'' ', I mean the symbol that you are shown when
an ASCII text renderer displays the byte #x2F, and the graphical and
semantic implications thereof.
(In Unicode, `character' is *not* a well-defined term; its definition
is explicitly avoided because there are so many different possible
concepts competing for the name, so none was given it. There are
abstract characters, encoded characters, code points, scalar values,
glyphs, default grapheme clusters, and more; this is why I explain
what I mean by `the character ``/'' '.)
(Also, by `byte' I mean the smallest addressible unit of memory --
i.e. what C very misleadingly calls `char' --, and by `octet' I mean
an array of eight bits.)
Now, on Unix, the various system calls and library routines that work
with pathnames represent them as arrays of bytes, because that is what
a pathname is on Unix. They don't know anything about text codecs,
internationalization, user localization, and so on; they deal with
arrays of bytes.
When writing programs that work with pathnames and perhaps present
them to users, however, we usually want two sorts of higher-level
abstraction: one, a structure identifiable to the programming
language, partly for convenience, partly for clarity, and partly for
the sake of portability; and decoding and encoding pathnames for
users.
The user has some notion of `text', which must be stored physically on
the file system somehow; the file system knows about bytes, which must
be presented to the user in an understandable format somehow. To
accomodate this, Unix has a number of environment variables (which
are, incidentally, also named by byte arrays) by which the user can
identify preferences in the means of translation between machine and
human understanding.
For example, the user might often deal with Hebrew and not much else,
and may choose to use ISO-8859-7 to store pathnames. Or another user
might be uninterested in Western isolationism, and instead prefer
culturally insulting approximation of Eastern languages by choosing to
store pathnames in UTF-8, allowing the full range of Unicode text in
his pathnames.
The operating system cares nothing about the user's sociocultural
background, however, and deals only in bytes.
These two users may need to communicate at some point, and they would
like their programs to present pathnames without loss of information.
But on the other hand, they probably also want a consistent file
system without two distinct file names (that is, byte arrays) for what
they each see as the same thing -- a particular name composed of
Hebrew and Latin text, say. (Unfortunately, there *must* be some
information lost in the translation from Unicode to ISO-8859-7, but
never mind that for now.) (By the way, I apologize if I have offended
anyone by my choice of cultural derogation -- my intent was to poke
fun at everyone equally.)
When writing portable programs that deal with pathnames, furthermore,
we often want two different sorts of pathnames: those handed to us by
the operating system or otherwise tied to the byte array model, and
those that we wrote ourselves or read from user input with a
higher-level notion of what text the user intends.
Now, it is not entirely clear what choices are best for the mapping
between the two sorts of pathnames I described -- `text pathnames' and
`byte pathnames', say.
In Scheme48 1.5, the decision is made for the programmer: whenever he
constructs a pathname (or `os-string') from a byte vector, its textual
interpretation is fixed by the locale when the Scheme48 image was
started; whenever he constructs a pathname (or `os-string') from a
textual description, its encoding in bytes is fixed by the locale at
the time the Scheme48 image was started.
ACL has extensive support for different text coding systems. For details see International Character Support in Allegro CL.
CLISP is usually compiled with libiconv and unicode
support. It supports all the coding systems provided by libiconv,
exported as variables from the CHARSET package. The EXTERNAL-FORMAT
arguments can take a EXT:ENCODING object which is composed of the
character set (as exported from CHARSET) and a line terminator mode,
which indicate which control codes are used to encode a #\NEW-LINE
pseudo-character.
There are command-line arguments and variables in the CUSTOM package
to set various default encodings (pathnames, file contents, terminal,
foreign function calls, miscellaneous).
Two functions are provided to encode and decode between strings of
characters and vector of bytes: EXT:CONVERT-STRING-FROM-BYTES and
EXT:CONVERT-STRING-TO-BYTES.
See the CLISP implementation notes, 30.5. Encodings.
CMUCL apparently supports utf-8 I/O through its simple-streams implementation, as well as iso-8859-1; its regular file streams support only iso-8859-1.
OpenMCL has no support for different text coding systems.
SBCL supports a limited number of coding systems. Because the interface has not been finalized yet, they are undocumented, but as of the 0.9.15 release the following are recognized:
Additionally, for in-memory conversion, SBCL provides the following functions
string -- a string.
octets -- a vector of (unsigned-byte 8).
external-format -- a keyword designating an external format (see above)
Until the interface with external formats is settled and documented (which essentially means figuring out how to handle newlines, byte-order-marks and similar), (apropos "EXTERNAL-FORMAT") is likely to be a good way to find out what is supported.
Please add links and indications about support of encodings in other
implementations.
The Flexi-streams library can “read and write character data in various single- or multi-octet encodings which can be changed on the fly..”
Flexi-streams supports UTF-8, UTF-16, UTF-32, all ISO 8859 character sets, a couple of Windows code pages and US-ASCII.
Wikipedia has more information about the various coding systems:
Markus Kuhn's UTF-8 and Unicode FAQ is another very comprehensive resource.
Nolan Eakins also has an article describing how to use Unicode with Slime, Clisp, and SBCL. The article's example is defining a function with a Unicode name.
Also, it might be worth mentioning that text files may have a
byte order mark if they are encoded in one of the Unicode encodings.
CLiki pages can be edited by anyone at any time. Imagine a fearsomely comprehensive disclaimer of liability. Now fear, comprehensivelyImplementation Specific Support for Encodings
ACL
CLISP
CMUCL
OpenMCL
SBCL
(:ASCII :CP1250 :CP1251 :CP1252 :CP1253 :CP1254 :CP1255 :CP1256 :CP1257 :CP1258
:CP437 :CP850 :CP852 :CP855 :CP857 :CP860 :CP861 :CP862 :CP863 :CP864 :CP865
:CP866 :CP869 :CP874 :EBCDIC-US :EUC-JP :ISO-8859-1 :ISO-8859-10 :ISO-8859-11
:ISO-8859-13 :ISO-8859-14 :ISO-8859-15 :ISO-8859-2 :ISO-8859-3 :ISO-8859-4
:ISO-8859-5 :ISO-8859-6 :ISO-8859-7 :ISO-8859-8 :ISO-8859-9 :KOI8-R :KOI8-U
:UCS-2BE :UCS-2LE :UTF-8 :X-MAC-CYRILLIC)
along with various aliases (so for instance :LATIN-9 is an alias for :ISO-8859-15). At present, there is no support for automatic translation of line-ending or byte-order-mark conventions.
SB-EXT:STRING-TO-OCTETS string &key external-format --> octets
SB-EXT:OCTETS-TO-STRING octets &key external-format --> string
Other implementations
Working with character encodings in a semi-portable way
External links
Comments from Jack Unrue:
I suggest changing the "if you want to copy a file" paragraph, such
that the recommedation is for verbatim file copy operations to treat
every file as if it were binary, to avoid accidental changes to
end-line formats in addition to character encodings. And use the word
"process" instead of "copy" in the second sentence of that paragraph.
Categories: Online Tutorial CLISP SBCL