Mercurial > emacs
comparison src/coding.c @ 18766:ac2e7e21abb0
Comment changes.
| author | Richard M. Stallman <rms@gnu.org> |
|---|---|
| date | Sun, 13 Jul 1997 20:43:31 +0000 |
| parents | 17039a6e64cf |
| children | 954e6be0a757 |
comparison
equal
deleted
inserted
replaced
| 18765:a407fb58d35f | 18766:ac2e7e21abb0 |
|---|---|
| 42 coding system. | 42 coding system. |
| 43 | 43 |
| 44 0. Emacs' internal format (emacs-mule) | 44 0. Emacs' internal format (emacs-mule) |
| 45 | 45 |
| 46 Emacs itself holds a multi-lingual character in a buffer and a string | 46 Emacs itself holds a multi-lingual character in a buffer and a string |
| 47 in a special format. Details are described in the section 2. | 47 in a special format. Details are described in section 2. |
| 48 | 48 |
| 49 1. ISO2022 | 49 1. ISO2022 |
| 50 | 50 |
| 51 The most famous coding system for multiple character sets. X's | 51 The most famous coding system for multiple character sets. X's |
| 52 Compound Text, various EUCs (Extended Unix Code), and such coding | 52 Compound Text, various EUCs (Extended Unix Code), and coding |
| 53 systems used in Internet communication as ISO-2022-JP are all | 53 systems used in Internet communication such as ISO-2022-JP are |
| 54 variants of ISO2022. Details are described in the section 3. | 54 all variants of ISO2022. Details are described in section 3. |
| 55 | 55 |
| 56 2. SJIS (or Shift-JIS or MS-Kanji-Code) | 56 2. SJIS (or Shift-JIS or MS-Kanji-Code) |
| 57 | 57 |
| 58 A coding system to encode character sets: ASCII, JISX0201, and | 58 A coding system to encode character sets: ASCII, JISX0201, and |
| 59 JISX0208. Widely used for PC's in Japan. Details are described in | 59 JISX0208. Widely used for PC's in Japan. Details are described in |
| 60 the section 4. | 60 section 4. |
| 61 | 61 |
| 62 3. BIG5 | 62 3. BIG5 |
| 63 | 63 |
| 64 A coding system to encode character sets: ASCII and Big5. Widely | 64 A coding system to encode character sets: ASCII and Big5. Widely |
| 65 used by Chinese (mainly in Taiwan and Hong Kong). Details are | 65 used by Chinese (mainly in Taiwan and Hong Kong). Details are |
| 66 described in the section 4. In this file, when written as "BIG5" | 66 described in section 4. In this file, when we write "BIG5" |
| 67 (all uppercase), it means the coding system, and when written as | 67 (all uppercase), we mean the coding system, and when we write |
| 68 "Big5" (capitalized), it means the character set. | 68 "Big5" (capitalized), we mean the character set. |
| 69 | 69 |
| 70 4. Else | 70 4. Other |
| 71 | 71 |
| 72 If a user want to read/write a text encoded in a coding system not | 72 If a user wants to read/write a text encoded in a coding system not |
| 73 listed above, he can supply a decoder and an encoder for it in CCL | 73 listed above, he can supply a decoder and an encoder for it in CCL |
| 74 (Code Conversion Language) programs. Emacs executes the CCL program | 74 (Code Conversion Language) programs. Emacs executes the CCL program |
| 75 while reading/writing. | 75 while reading/writing. |
| 76 | 76 |
| 77 Emacs represent a coding-system by a Lisp symbol that has a property | 77 Emacs represents a coding-system by a Lisp symbol that has a property |
| 78 `coding-system'. But, before actually using the coding-system, the | 78 `coding-system'. But, before actually using the coding-system, the |
| 79 information about it is set in a structure of type `struct | 79 information about it is set in a structure of type `struct |
| 80 coding_system' for rapid processing. See the section 6 for more | 80 coding_system' for rapid processing. See section 6 for more details. |
| 81 detail. | |
| 82 | 81 |
| 83 */ | 82 */ |
| 84 | 83 |
| 85 /*** GENERAL NOTES on END-OF-LINE FORMAT *** | 84 /*** GENERAL NOTES on END-OF-LINE FORMAT *** |
| 86 | 85 |
| 87 How end-of-line of a text is encoded depends on a system. For | 86 How end-of-line of a text is encoded depends on a system. For |
| 88 instance, Unix's format is just one byte of `line-feed' code, | 87 instance, Unix's format is just one byte of `line-feed' code, |
| 89 whereas DOS's format is two bytes sequence of `carriage-return' and | 88 whereas DOS's format is two-byte sequence of `carriage-return' and |
| 90 `line-feed' codes. MacOS's format is one byte of `carriage-return'. | 89 `line-feed' codes. MacOS's format is one byte of `carriage-return'. |
| 91 | 90 |
| 92 Since how characters in a text is encoded and how end-of-line is | 91 Since text characters encoding and end-of-line encoding are |
| 93 encoded is independent, any coding system described above can take | 92 independent, any coding system described above can take |
| 94 any format of end-of-line. So, Emacs has information of format of | 93 any format of end-of-line. So, Emacs has information of format of |
| 95 end-of-line in each coding-system. See the section 6 for more | 94 end-of-line in each coding-system. See section 6 for more details. |
| 96 detail. | |
| 97 | 95 |
| 98 */ | 96 */ |
| 99 | 97 |
| 100 /*** GENERAL NOTES on `detect_coding_XXX ()' functions *** | 98 /*** GENERAL NOTES on `detect_coding_XXX ()' functions *** |
| 101 | 99 |
| 115 | 113 |
| 116 /*** GENERAL NOTES on `decode_coding_XXX ()' functions *** | 114 /*** GENERAL NOTES on `decode_coding_XXX ()' functions *** |
| 117 | 115 |
| 118 These functions decode SRC_BYTES length text at SOURCE encoded in | 116 These functions decode SRC_BYTES length text at SOURCE encoded in |
| 119 CODING to Emacs' internal format (emacs-mule). The resulting text | 117 CODING to Emacs' internal format (emacs-mule). The resulting text |
| 120 goes to a place pointed by DESTINATION, the length of which should | 118 goes to a place pointed to by DESTINATION, the length of which should |
| 121 not exceed DST_BYTES. The bytes actually processed is returned as | 119 not exceed DST_BYTES. The number of bytes actually processed is |
| 122 *CONSUMED. The return value is the length of the decoded text. | 120 returned as *CONSUMED. The return value is the length of the decoded |
| 123 Below is a template of these functions. */ | 121 text. Below is a template of these functions. */ |
| 124 #if 0 | 122 #if 0 |
| 125 decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes, consumed) | 123 decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes, consumed) |
| 126 struct coding_system *coding; | 124 struct coding_system *coding; |
| 127 unsigned char *source, *destination; | 125 unsigned char *source, *destination; |
| 128 int src_bytes, dst_bytes; | 126 int src_bytes, dst_bytes; |
| 134 | 132 |
| 135 /*** GENERAL NOTES on `encode_coding_XXX ()' functions *** | 133 /*** GENERAL NOTES on `encode_coding_XXX ()' functions *** |
| 136 | 134 |
| 137 These functions encode SRC_BYTES length text at SOURCE of Emacs' | 135 These functions encode SRC_BYTES length text at SOURCE of Emacs' |
| 138 internal format (emacs-mule) to CODING. The resulting text goes to | 136 internal format (emacs-mule) to CODING. The resulting text goes to |
| 139 a place pointed by DESTINATION, the length of which should not | 137 a place pointed to by DESTINATION, the length of which should not |
| 140 exceed DST_BYTES. The bytes actually processed is returned as | 138 exceed DST_BYTES. The number of bytes actually processed is |
| 141 *CONSUMED. The return value is the length of the encoded text. | 139 returned as *CONSUMED. The return value is the length of the |
| 142 Below is a template of these functions. */ | 140 encoded text. Below is a template of these functions. */ |
| 143 #if 0 | 141 #if 0 |
| 144 encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes, consumed) | 142 encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes, consumed) |
| 145 struct coding_system *coding; | 143 struct coding_system *coding; |
| 146 unsigned char *source, *destination; | 144 unsigned char *source, *destination; |
| 147 int src_bytes, dst_bytes; | 145 int src_bytes, dst_bytes; |
| 198 *dst++ = 0xA0, *dst++ = (c) | 0x80; \ | 196 *dst++ = 0xA0, *dst++ = (c) | 0x80; \ |
| 199 else \ | 197 else \ |
| 200 *dst++ = (c); \ | 198 *dst++ = (c); \ |
| 201 } while (0) | 199 } while (0) |
| 202 | 200 |
| 203 /* Decode one DIMENSION1 character of which charset is CHARSET and | 201 /* Decode one DIMENSION1 character whose charset is CHARSET and whose |
| 204 position-code is C. */ | 202 position-code is C. */ |
| 205 | 203 |
| 206 #define DECODE_CHARACTER_DIMENSION1(charset, c) \ | 204 #define DECODE_CHARACTER_DIMENSION1(charset, c) \ |
| 207 do { \ | 205 do { \ |
| 208 unsigned char leading_code = CHARSET_LEADING_CODE_BASE (charset); \ | 206 unsigned char leading_code = CHARSET_LEADING_CODE_BASE (charset); \ |
| 213 if (leading_code = CHARSET_LEADING_CODE_EXT (charset)) \ | 211 if (leading_code = CHARSET_LEADING_CODE_EXT (charset)) \ |
| 214 *dst++ = leading_code; \ | 212 *dst++ = leading_code; \ |
| 215 *dst++ = (c) | 0x80; \ | 213 *dst++ = (c) | 0x80; \ |
| 216 } while (0) | 214 } while (0) |
| 217 | 215 |
| 218 /* Decode one DIMENSION2 character of which charset is CHARSET and | 216 /* Decode one DIMENSION2 character whose charset is CHARSET and whose |
| 219 position-codes are C1 and C2. */ | 217 position-codes are C1 and C2. */ |
| 220 | 218 |
| 221 #define DECODE_CHARACTER_DIMENSION2(charset, c1, c2) \ | 219 #define DECODE_CHARACTER_DIMENSION2(charset, c1, c2) \ |
| 222 do { \ | 220 do { \ |
| 223 DECODE_CHARACTER_DIMENSION1 (charset, c1); \ | 221 DECODE_CHARACTER_DIMENSION1 (charset, c1); \ |
| 335 | 333 |
| 336 | 334 |
| 337 /*** 2. Emacs internal format (emacs-mule) handlers ***/ | 335 /*** 2. Emacs internal format (emacs-mule) handlers ***/ |
| 338 | 336 |
| 339 /* Emacs' internal format for encoding multiple character sets is a | 337 /* Emacs' internal format for encoding multiple character sets is a |
| 340 kind of multi-byte encoding, i.e. encoding a character by a sequence | 338 kind of multi-byte encoding, i.e. characters are encoded by |
| 341 of one-byte codes of variable length. ASCII characters and control | 339 variable-length sequences of one-byte codes. ASCII characters |
| 342 characters (e.g. `tab', `newline') are represented by one-byte as | 340 and control characters (e.g. `tab', `newline') are represented by |
| 343 is. It takes the range 0x00 through 0x7F. The other characters | 341 one-byte sequences which are their ASCII codes, in the range 0x00 |
| 344 are represented by a sequence of `base leading-code', optional | 342 through 0x7F. The other characters are represented by a sequence |
| 345 `extended leading-code', and one or two `position-code's. Length | 343 of `base leading-code', optional `extended leading-code', and one |
| 346 of the sequence is decided by the base leading-code. Leading-code | 344 or two `position-code's. The length of the sequence is determined |
| 347 takes the range 0x80 through 0x9F, whereas extended leading-code | 345 by the base leading-code. Leading-code takes the range 0x80 |
| 348 and position-code take the range 0xA0 through 0xFF. See the | 346 through 0x9F, whereas extended leading-code and position-code take |
| 349 document of `charset.h' for more detail about leading-code and | 347 the range 0xA0 through 0xFF. See `charset.h' for more details |
| 350 position-code. | 348 about leading-code and position-code. |
| 351 | 349 |
| 352 There's one exception in this rule. Special leading-code | 350 There's one exception to this rule. Special leading-code |
| 353 `leading-code-composition' denotes that the following several | 351 `leading-code-composition' denotes that the following several |
| 354 characters should be composed into one character. Leading-codes of | 352 characters should be composed into one character. Leading-codes of |
| 355 components (except for ASCII) are added 0x20. An ASCII character | 353 components (except for ASCII) are added 0x20. An ASCII character |
| 356 component is represented by a 2-byte sequence of `0xA0' and | 354 component is represented by a 2-byte sequence of `0xA0' and |
| 357 `ASCII-code + 0x80'. See also the document in `charset.h' for the | 355 `ASCII-code + 0x80'. See also the comments in `charset.h' for the |
| 358 detail of composite character. Hence, we can summarize the code | 356 details of composite character. Hence, we can summarize the code |
| 359 range as follows: | 357 range as follows: |
| 360 | 358 |
| 361 --- CODE RANGE of Emacs' internal format --- | 359 --- CODE RANGE of Emacs' internal format --- |
| 362 (character set) (range) | 360 (character set) (range) |
| 363 ASCII 0x00 .. 0x7F | 361 ASCII 0x00 .. 0x7F |
| 445 | 443 |
| 446 | 444 |
| 447 /*** 3. ISO2022 handlers ***/ | 445 /*** 3. ISO2022 handlers ***/ |
| 448 | 446 |
| 449 /* The following note describes the coding system ISO2022 briefly. | 447 /* The following note describes the coding system ISO2022 briefly. |
| 450 Since the intension of this note is to help understanding of the | 448 Since the intention of this note is to help in understanding of |
| 451 programs in this file, some parts are NOT ACCURATE or OVERLY | 449 the programs in this file, some parts are NOT ACCURATE or OVERLY |
| 452 SIMPLIFIED. For the thorough understanding, please refer to the | 450 SIMPLIFIED. For the thorough understanding, please refer to the |
| 453 original document of ISO2022. | 451 original document of ISO2022. |
| 454 | 452 |
| 455 ISO2022 provides many mechanisms to encode several character sets | 453 ISO2022 provides many mechanisms to encode several character sets |
| 456 in 7-bit and 8-bit environment. If one choose 7-bite environment, | 454 in 7-bit and 8-bit environment. If one chooses 7-bite environment, |
| 457 all text is encoded by codes of less than 128. This may make the | 455 all text is encoded by codes of less than 128. This may make the |
| 458 encoded text a little bit longer, but the text get more stability | 456 encoded text a little bit longer, but the text gets more stability |
| 459 to pass through several gateways (some of them split MSB off). | 457 to pass through several gateways (some of them strip off the MSB). |
| 460 | 458 |
| 461 There are two kind of character set: control character set and | 459 There are two kinds of character set: control character set and |
| 462 graphic character set. The former contains control characters such | 460 graphic character set. The former contains control characters such |
| 463 as `newline' and `escape' to provide control functions (control | 461 as `newline' and `escape' to provide control functions (control |
| 464 functions are provided also by escape sequence). The latter | 462 functions are provided also by escape sequences). The latter |
| 465 contains graphic characters such as ' A' and '-'. Emacs recognizes | 463 contains graphic characters such as ' A' and '-'. Emacs recognizes |
| 466 two control character sets and many graphic character sets. | 464 two control character sets and many graphic character sets. |
| 467 | 465 |
| 468 Graphic character sets are classified into one of the following | 466 Graphic character sets are classified into one of the following |
| 469 four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96, | 467 four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96, |
| 563 | 561 |
| 564 Note (**): If <F> is '@', 'A', or 'B', the intermediate character | 562 Note (**): If <F> is '@', 'A', or 'B', the intermediate character |
| 565 '(' can be omitted. We call this as "short-form" here after. | 563 '(' can be omitted. We call this as "short-form" here after. |
| 566 | 564 |
| 567 Now you may notice that there are a lot of ways for encoding the | 565 Now you may notice that there are a lot of ways for encoding the |
| 568 same multilingual text in ISO2022. Actually, there exist many | 566 same multilingual text in ISO2022. Actually, there exists many |
| 569 coding systems such as Compound Text (used in X's inter client | 567 coding systems such as Compound Text (used in X's inter client |
| 570 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR | 568 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR |
| 571 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian | 569 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian |
| 572 localized platforms), and all of these are variants of ISO2022. | 570 localized platforms), and all of these are variants of ISO2022. |
| 573 | 571 |
| 1016 } | 1014 } |
| 1017 *consumed = src - source; | 1015 *consumed = src - source; |
| 1018 return dst - destination; | 1016 return dst - destination; |
| 1019 } | 1017 } |
| 1020 | 1018 |
| 1021 /* ISO2022 encoding staffs. */ | 1019 /* ISO2022 encoding stuff. */ |
| 1022 | 1020 |
| 1023 /* | 1021 /* |
| 1024 It is not enough to say just "ISO2022" on encoding, but we have to | 1022 It is not enough to say just "ISO2022" on encoding, we have to |
| 1025 specify more details. In Emacs, each coding-system of ISO2022 | 1023 specify more details. In Emacs, each coding-system of ISO2022 |
| 1026 variant has the following specifications: | 1024 variant has the following specifications: |
| 1027 1. Initial designation to G0 thru G3. | 1025 1. Initial designation to G0 thru G3. |
| 1028 2. Allows short-form designation? | 1026 2. Allows short-form designation? |
| 1029 3. ASCII should be designated to G0 before control characters? | 1027 3. ASCII should be designated to G0 before control characters? |
| 1034 And the following two are only for Japanese: | 1032 And the following two are only for Japanese: |
| 1035 8. Use ASCII in place of JIS0201-1976-Roman? | 1033 8. Use ASCII in place of JIS0201-1976-Roman? |
| 1036 9. Use JISX0208-1983 in place of JISX0208-1978? | 1034 9. Use JISX0208-1983 in place of JISX0208-1978? |
| 1037 These specifications are encoded in `coding->flags' as flag bits | 1035 These specifications are encoded in `coding->flags' as flag bits |
| 1038 defined by macros CODING_FLAG_ISO_XXX. See `coding.h' for more | 1036 defined by macros CODING_FLAG_ISO_XXX. See `coding.h' for more |
| 1039 detail. | 1037 details. |
| 1040 */ | 1038 */ |
| 1041 | 1039 |
| 1042 /* Produce codes (escape sequence) for designating CHARSET to graphic | 1040 /* Produce codes (escape sequence) for designating CHARSET to graphic |
| 1043 register REG. If <final-char> of CHARSET is '@', 'A', or 'B' and | 1041 register REG. If <final-char> of CHARSET is '@', 'A', or 'B' and |
| 1044 the coding system CODING allows, produce designation sequence of | 1042 the coding system CODING allows, produce designation sequence of |
| 1130 do { \ | 1128 do { \ |
| 1131 *dst++ = ISO_CODE_ESC, *dst++ = 'o'; \ | 1129 *dst++ = ISO_CODE_ESC, *dst++ = 'o'; \ |
| 1132 CODING_SPEC_ISO_INVOCATION (coding, 0) = 3; \ | 1130 CODING_SPEC_ISO_INVOCATION (coding, 0) = 3; \ |
| 1133 } while (0) | 1131 } while (0) |
| 1134 | 1132 |
| 1135 /* Produce codes for a DIMENSION1 character of which character set is | 1133 /* Produce codes for a DIMENSION1 character whose character set is |
| 1136 CHARSET and position-code is C1. Designation and invocation | 1134 CHARSET and whose position-code is C1. Designation and invocation |
| 1137 sequences are also produced in advance if necessary. */ | 1135 sequences are also produced in advance if necessary. */ |
| 1138 | 1136 |
| 1139 | 1137 |
| 1140 #define ENCODE_ISO_CHARACTER_DIMENSION1(charset, c1) \ | 1138 #define ENCODE_ISO_CHARACTER_DIMENSION1(charset, c1) \ |
| 1141 do { \ | 1139 do { \ |
| 1164 register. Then repeat the loop to actually produce the \ | 1162 register. Then repeat the loop to actually produce the \ |
| 1165 character. */ \ | 1163 character. */ \ |
| 1166 dst = encode_invocation_designation (charset, coding, dst); \ | 1164 dst = encode_invocation_designation (charset, coding, dst); \ |
| 1167 } while (1) | 1165 } while (1) |
| 1168 | 1166 |
| 1169 /* Produce codes for a DIMENSION2 character of which character set is | 1167 /* Produce codes for a DIMENSION2 character whose character set is |
| 1170 CHARSET and position-codes are C1 and C2. Designation and | 1168 CHARSET and whose position-codes are C1 and C2. Designation and |
| 1171 invocation codes are also produced in advance if necessary. */ | 1169 invocation codes are also produced in advance if necessary. */ |
| 1172 | 1170 |
| 1173 #define ENCODE_ISO_CHARACTER_DIMENSION2(charset, c1, c2) \ | 1171 #define ENCODE_ISO_CHARACTER_DIMENSION2(charset, c1, c2) \ |
| 1174 do { \ | 1172 do { \ |
| 1175 if (CODING_SPEC_ISO_SINGLE_SHIFTING (coding)) \ | 1173 if (CODING_SPEC_ISO_SINGLE_SHIFTING (coding)) \ |
| 1550 } | 1548 } |
| 1551 | 1549 |
| 1552 | 1550 |
| 1553 /*** 4. SJIS and BIG5 handlers ***/ | 1551 /*** 4. SJIS and BIG5 handlers ***/ |
| 1554 | 1552 |
| 1555 /* Although SJIS and BIG5 are not ISO's coding system, They are used | 1553 /* Although SJIS and BIG5 are not ISO's coding system, they are used |
| 1556 quite widely. So, for the moment, Emacs supports them in the bare | 1554 quite widely. So, for the moment, Emacs supports them in the bare |
| 1557 C code. But, in the future, they may be supported only by CCL. */ | 1555 C code. But, in the future, they may be supported only by CCL. */ |
| 1558 | 1556 |
| 1559 /* SJIS is a coding system encoding three character sets: ASCII, right | 1557 /* SJIS is a coding system encoding three character sets: ASCII, right |
| 1560 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded | 1558 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded |
| 2165 Lisp_Object coding_system; | 2163 Lisp_Object coding_system; |
| 2166 struct coding_system *coding; | 2164 struct coding_system *coding; |
| 2167 { | 2165 { |
| 2168 Lisp_Object type, eol_type; | 2166 Lisp_Object type, eol_type; |
| 2169 | 2167 |
| 2170 /* At first, set several fields default values. */ | 2168 /* At first, set several fields to default values. */ |
| 2171 coding->require_flushing = 0; | 2169 coding->require_flushing = 0; |
| 2172 coding->last_block = 0; | 2170 coding->last_block = 0; |
| 2173 coding->selective = 0; | 2171 coding->selective = 0; |
| 2174 coding->composing = 0; | 2172 coding->composing = 0; |
| 2175 coding->direction = 0; | 2173 coding->direction = 0; |
