Mercurial > emacs
comparison src/coding.c @ 88771:75c78754826d
comments
| author | Dave Love <fx@gnu.org> |
|---|---|
| date | Sun, 16 Jun 2002 19:57:54 +0000 |
| parents | 7f284ac55b07 |
| children | 64b8f6168269 |
comparison
equal
deleted
inserted
replaced
| 88770:7df1e731d256 | 88771:75c78754826d |
|---|---|
| 92 section 8. | 92 section 8. |
| 93 | 93 |
| 94 o BIG5 | 94 o BIG5 |
| 95 | 95 |
| 96 A coding system to encode character sets: ASCII and Big5. Widely | 96 A coding system to encode character sets: ASCII and Big5. Widely |
| 97 used by Chinese (mainly in Taiwan and Hong Kong). Details are | 97 used for Chinese (mainly in Taiwan and Hong Kong). Details are |
| 98 described in section 8. In this file, when we write "big5" (all | 98 described in section 8. In this file, when we write "big5" (all |
| 99 lowercase), we mean the coding system, and when we write "Big5" | 99 lowercase), we mean the coding system, and when we write "Big5" |
| 100 (capitalized), we mean the character set. | 100 (capitalized), we mean the character set. |
| 101 | 101 |
| 102 o CCL | 102 o CCL |
| 106 CCL (Code Conversion Language) programs. Emacs executes the CCL | 106 CCL (Code Conversion Language) programs. Emacs executes the CCL |
| 107 program while decoding/encoding. | 107 program while decoding/encoding. |
| 108 | 108 |
| 109 o Raw-text | 109 o Raw-text |
| 110 | 110 |
| 111 A coding system for a text containing raw eight-bit data. Emacs | 111 A coding system for text containing raw eight-bit data. Emacs |
| 112 treats each byte of source text as a character (except for | 112 treats each byte of source text as a character (except for |
| 113 end-of-line conversion). | 113 end-of-line conversion). |
| 114 | 114 |
| 115 o No-conversion | 115 o No-conversion |
| 116 | 116 |
| 585 AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_encoder) | 585 AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_encoder) |
| 586 #define CODING_CCL_VALIDS(coding) \ | 586 #define CODING_CCL_VALIDS(coding) \ |
| 587 (XSTRING (AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_valids)) \ | 587 (XSTRING (AREF (CODING_ID_ATTRS ((coding)->id), coding_attr_ccl_valids)) \ |
| 588 ->data) | 588 ->data) |
| 589 | 589 |
| 590 /* Index for each coding category in `coding_category_table' */ | 590 /* Index for each coding category in `coding_categories' */ |
| 591 | 591 |
| 592 enum coding_category | 592 enum coding_category |
| 593 { | 593 { |
| 594 coding_category_iso_7, | 594 coding_category_iso_7, |
| 595 coding_category_iso_7_tight, | 595 coding_category_iso_7_tight, |
| 2047 | 2047 |
| 2048 /*** 7. ISO2022 handlers ***/ | 2048 /*** 7. ISO2022 handlers ***/ |
| 2049 | 2049 |
| 2050 /* The following note describes the coding system ISO2022 briefly. | 2050 /* The following note describes the coding system ISO2022 briefly. |
| 2051 Since the intention of this note is to help understand the | 2051 Since the intention of this note is to help understand the |
| 2052 functions in this file, some parts are NOT ACCURATE or OVERLY | 2052 functions in this file, some parts are NOT ACCURATE or are OVERLY |
| 2053 SIMPLIFIED. For thorough understanding, please refer to the | 2053 SIMPLIFIED. For thorough understanding, please refer to the |
| 2054 original document of ISO2022. | 2054 original document of ISO2022. This is equivalent to the standard |
| 2055 ECMA-35, obtainable from <URL:http://www.ecma.ch/> (*). | |
| 2055 | 2056 |
| 2056 ISO2022 provides many mechanisms to encode several character sets | 2057 ISO2022 provides many mechanisms to encode several character sets |
| 2057 in 7-bit and 8-bit environments. For 7-bite environments, all text | 2058 in 7-bit and 8-bit environments. For 7-bit environments, all text |
| 2058 is encoded using bytes less than 128. This may make the encoded | 2059 is encoded using bytes less than 128. This may make the encoded |
| 2059 text a little bit longer, but the text passes more easily through | 2060 text a little bit longer, but the text passes more easily through |
| 2060 several gateways, some of which strip off MSB (Most Signigant Bit). | 2061 several types of gateway, some of which strip off the MSB (Most |
| 2061 | 2062 Significant Bit). |
| 2062 There are two kinds of character sets: control character set and | 2063 |
| 2063 graphic character set. The former contains control characters such | 2064 There are two kinds of character sets: control character sets and |
| 2065 graphic character sets. The former contain control characters such | |
| 2064 as `newline' and `escape' to provide control functions (control | 2066 as `newline' and `escape' to provide control functions (control |
| 2065 functions are also provided by escape sequences). The latter | 2067 functions are also provided by escape sequences). The latter |
| 2066 contains graphic characters such as 'A' and '-'. Emacs recognizes | 2068 contain graphic characters such as 'A' and '-'. Emacs recognizes |
| 2067 two control character sets and many graphic character sets. | 2069 two control character sets and many graphic character sets. |
| 2068 | 2070 |
| 2069 Graphic character sets are classified into one of the following | 2071 Graphic character sets are classified into one of the following |
| 2070 four classes, according to the number of bytes (DIMENSION) and | 2072 four classes, according to the number of bytes (DIMENSION) and |
| 2071 number of characters in one dimension (CHARS) of the set: | 2073 number of characters in one dimension (CHARS) of the set: |
| 2073 - DIMENSION1_CHARS96 | 2075 - DIMENSION1_CHARS96 |
| 2074 - DIMENSION2_CHARS94 | 2076 - DIMENSION2_CHARS94 |
| 2075 - DIMENSION2_CHARS96 | 2077 - DIMENSION2_CHARS96 |
| 2076 | 2078 |
| 2077 In addition, each character set is assigned an identification tag, | 2079 In addition, each character set is assigned an identification tag, |
| 2078 unique for each set, called "final character" (denoted as <F> | 2080 unique for each set, called the "final character" (denoted as <F> |
| 2079 hereafter). The <F> of each character set is decided by ECMA(*) | 2081 hereafter). The <F> of each character set is decided by ECMA(*) |
| 2080 when it is registered in ISO. The code range of <F> is 0x30..0x7F | 2082 when it is registered in ISO. The code range of <F> is 0x30..0x7F |
| 2081 (0x30..0x3F are for private use only). | 2083 (0x30..0x3F are for private use only). |
| 2082 | 2084 |
| 2083 Note (*): ECMA = European Computer Manufacturers Association | 2085 Note (*): ECMA = European Computer Manufacturers Association |
| 2084 | 2086 |
| 2085 Here are examples of graphic character set [NAME(<F>)]: | 2087 Here are examples of graphic character sets [NAME(<F>)]: |
| 2086 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... | 2088 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... |
| 2087 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... | 2089 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... |
| 2088 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... | 2090 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... |
| 2089 o DIMENSION2_CHARS96 -- none for the moment | 2091 o DIMENSION2_CHARS96 -- none for the moment |
| 2090 | 2092 |
| 2173 7-bit environment, non-locking-shift, and non-single-shift. | 2175 7-bit environment, non-locking-shift, and non-single-shift. |
| 2174 | 2176 |
| 2175 Note (**): If <F> is '@', 'A', or 'B', the intermediate character | 2177 Note (**): If <F> is '@', 'A', or 'B', the intermediate character |
| 2176 '(' must be omitted. We refer to this as "short-form" hereafter. | 2178 '(' must be omitted. We refer to this as "short-form" hereafter. |
| 2177 | 2179 |
| 2178 Now you may notice that there are a lot of ways for encoding the | 2180 Now you may notice that there are a lot of ways of encoding the |
| 2179 same multilingual text in ISO2022. Actually, there exist many | 2181 same multilingual text in ISO2022. Actually, there exist many |
| 2180 coding systems such as Compound Text (used in X11's inter client | 2182 coding systems such as Compound Text (used in X11's inter client |
| 2181 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR | 2183 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR |
| 2182 (used in Korean internet), EUC (Extended UNIX Code, used in Asian | 2184 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian |
| 2183 localized platforms), and all of these are variants of ISO2022. | 2185 localized platforms), and all of these are variants of ISO2022. |
| 2184 | 2186 |
| 2185 In addition to the above, Emacs handles two more kinds of escape | 2187 In addition to the above, Emacs handles two more kinds of escape |
| 2186 sequences: ISO6429's direction specification and Emacs' private | 2188 sequences: ISO6429's direction specification and Emacs' private |
| 2187 sequence for specifying character composition. | 2189 sequence for specifying character composition. |
| 2199 o ESC '1' -- end composition | 2201 o ESC '1' -- end composition |
| 2200 o ESC '2' -- start rule-base composition (*) | 2202 o ESC '2' -- start rule-base composition (*) |
| 2201 o ESC '3' -- start relative composition with alternate chars (**) | 2203 o ESC '3' -- start relative composition with alternate chars (**) |
| 2202 o ESC '4' -- start rule-base composition with alternate chars (**) | 2204 o ESC '4' -- start rule-base composition with alternate chars (**) |
| 2203 Since these are not standard escape sequences of any ISO standard, | 2205 Since these are not standard escape sequences of any ISO standard, |
| 2204 the use of them for these meaning is restricted to Emacs only. | 2206 the use of them with these meanings is restricted to Emacs only. |
| 2205 | 2207 |
| 2206 (*) This form is used only in Emacs 20.5 and the older versions, | 2208 (*) This form is used only in Emacs 20.7 and older versions, |
| 2207 but the newer versions can safely decode it. | 2209 but newer versions can safely decode it. |
| 2208 (**) This form is used only in Emacs 21.1 and the newer versions, | 2210 (**) This form is used only in Emacs 21.1 and newer versions, |
| 2209 and the older versions can't decode it. | 2211 and older versions can't decode it. |
| 2210 | 2212 |
| 2211 Here's a list of examples usages of these composition escape | 2213 Here's a list of example usages of these composition escape |
| 2212 sequences (categorized by `enum composition_method'). | 2214 sequences (categorized by `enum composition_method'). |
| 2213 | 2215 |
| 2214 COMPOSITION_RELATIVE: | 2216 COMPOSITION_RELATIVE: |
| 2215 ESC 0 CHAR [ CHAR ] ESC 1 | 2217 ESC 0 CHAR [ CHAR ] ESC 1 |
| 2216 COMPOSITOIN_WITH_RULE: | 2218 COMPOSITION_WITH_RULE: |
| 2217 ESC 2 CHAR [ RULE CHAR ] ESC 1 | 2219 ESC 2 CHAR [ RULE CHAR ] ESC 1 |
| 2218 COMPOSITION_WITH_ALTCHARS: | 2220 COMPOSITION_WITH_ALTCHARS: |
| 2219 ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 | 2221 ESC 3 ALTCHAR [ ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 |
| 2220 COMPOSITION_WITH_RULE_ALTCHARS: | 2222 COMPOSITION_WITH_RULE_ALTCHARS: |
| 2221 ESC 4 ALTCHAR [ RULE ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 */ | 2223 ESC 4 ALTCHAR [ RULE ALTCHAR ] ESC 0 CHAR [ CHAR ] ESC 1 */ |
| 4533 } | 4535 } |
| 4534 | 4536 |
| 4535 | 4537 |
| 4536 /*** 7. C library functions ***/ | 4538 /*** 7. C library functions ***/ |
| 4537 | 4539 |
| 4538 /* In Emacs Lisp, coding system is represented by a Lisp symbol which | |
| 4539 has a property `coding-system'. The value of this property is a | |
| 4540 vector of length 5 (called as coding-vector). Among elements of | |
| 4541 this vector, the first (element[0]) and the fifth (element[4]) | |
| 4542 carry important information for decoding/encoding. Before | |
| 4543 decoding/encoding, this information should be set in fields of a | |
| 4544 structure of type `coding_system'. | |
| 4545 | |
| 4546 A value of property `coding-system' can be a symbol of another | |
| 4547 subsidiary coding-system. In that case, Emacs gets coding-vector | |
| 4548 from that symbol. | |
| 4549 | |
| 4550 `element[0]' contains information to be set in `coding->type'. The | |
| 4551 value and its meaning is as follows: | |
| 4552 | |
| 4553 0 -- coding_type_emacs_mule | |
| 4554 1 -- coding_type_sjis | |
| 4555 2 -- coding_type_iso_2022 | |
| 4556 3 -- coding_type_big5 | |
| 4557 4 -- coding_type_ccl encoder/decoder written in CCL | |
| 4558 nil -- coding_type_no_conversion | |
| 4559 t -- coding_type_undecided (automatic conversion on decoding, | |
| 4560 no-conversion on encoding) | |
| 4561 | |
| 4562 `element[4]' contains information to be set in `coding->flags' and | |
| 4563 `coding->spec'. The meaning varies by `coding->type'. | |
| 4564 | |
| 4565 If `coding->type' is `coding_type_iso_2022', element[4] is a vector | |
| 4566 of length 32 (of which the first 13 sub-elements are used now). | |
| 4567 Meanings of these sub-elements are: | |
| 4568 | |
| 4569 sub-element[N] where N is 0 through 3: to be set in `coding->spec.iso_2022' | |
| 4570 If the value is an integer of valid charset, the charset is | |
| 4571 assumed to be designated to graphic register N initially. | |
| 4572 | |
| 4573 If the value is minus, it is a minus value of charset which | |
| 4574 reserves graphic register N, which means that the charset is | |
| 4575 not designated initially but should be designated to graphic | |
| 4576 register N just before encoding a character in that charset. | |
| 4577 | |
| 4578 If the value is nil, graphic register N is never used on | |
| 4579 encoding. | |
| 4580 | |
| 4581 sub-element[N] where N is 4 through 11: to be set in `coding->flags' | |
| 4582 Each value takes t or nil. See the section ISO2022 of | |
| 4583 `coding.h' for more information. | |
| 4584 | |
| 4585 If `coding->type' is `coding_type_big5', element[4] is t to denote | |
| 4586 BIG5-ETen or nil to denote BIG5-HKU. | |
| 4587 | |
| 4588 If `coding->type' takes the other value, element[4] is ignored. | |
| 4589 | |
| 4590 Emacs Lisp's coding system also carries information about format of | |
| 4591 end-of-line in a value of property `eol-type'. If the value is | |
| 4592 integer, 0 means eol_lf, 1 means eol_crlf, and 2 means eol_cr. If | |
| 4593 it is not integer, it should be a vector of subsidiary coding | |
| 4594 systems of which property `eol-type' has one of above values. | |
| 4595 | |
| 4596 */ | |
| 4597 | |
| 4598 /* Setup coding context CODING from information about CODING_SYSTEM. | 4540 /* Setup coding context CODING from information about CODING_SYSTEM. |
| 4599 If CODING_SYSTEM is nil, `no-conversion' is assumed. If | 4541 If CODING_SYSTEM is nil, `no-conversion' is assumed. If |
| 4600 CODING_SYSTEM is invalid, signal an error. */ | 4542 CODING_SYSTEM is invalid, signal an error. */ |
| 4601 | 4543 |
| 4602 void | 4544 void |
