Mercurial > emacs
comparison src/coding.c @ 35053:e3e1ff3616fa
Commentary changes.
(detect_eol_type_in_2_octet_form): Declare arg big_endian_p.
| author | Dave Love <fx@gnu.org> |
|---|---|
| date | Thu, 04 Jan 2001 17:35:26 +0000 |
| parents | 8cd5e6ad71a2 |
| children | 36de5bf9969c |
comparison
equal
deleted
inserted
replaced
| 35052:07b5f5fdb0ce | 35053:e3e1ff3616fa |
|---|---|
| 35 */ | 35 */ |
| 36 | 36 |
| 37 /*** 0. General comments ***/ | 37 /*** 0. General comments ***/ |
| 38 | 38 |
| 39 | 39 |
| 40 /*** GENERAL NOTE on CODING SYSTEM *** | 40 /*** GENERAL NOTE on CODING SYSTEMS *** |
| 41 | 41 |
| 42 Coding system is an encoding mechanism of one or more character | 42 A coding system is an encoding mechanism for one or more character |
| 43 sets. Here's a list of coding systems which Emacs can handle. When | 43 sets. Here's a list of coding systems which Emacs can handle. When |
| 44 we say "decode", it means converting some other coding system to | 44 we say "decode", it means converting some other coding system to |
| 45 Emacs' internal format (emacs-internal), and when we say "encode", | 45 Emacs' internal format (emacs-mule), and when we say "encode", |
| 46 it means converting the coding system emacs-mule to some other | 46 it means converting the coding system emacs-mule to some other |
| 47 coding system. | 47 coding system. |
| 48 | 48 |
| 49 0. Emacs' internal format (emacs-mule) | 49 0. Emacs' internal format (emacs-mule) |
| 50 | 50 |
| 51 Emacs itself holds a multi-lingual character in a buffer and a string | 51 Emacs itself holds a multi-lingual character in buffers and strings |
| 52 in a special format. Details are described in section 2. | 52 in a special format. Details are described in section 2. |
| 53 | 53 |
| 54 1. ISO2022 | 54 1. ISO2022 |
| 55 | 55 |
| 56 The most famous coding system for multiple character sets. X's | 56 The most famous coding system for multiple character sets. X's |
| 64 JISX0208. Widely used for PC's in Japan. Details are described in | 64 JISX0208. Widely used for PC's in Japan. Details are described in |
| 65 section 4. | 65 section 4. |
| 66 | 66 |
| 67 3. BIG5 | 67 3. BIG5 |
| 68 | 68 |
| 69 A coding system to encode character sets: ASCII and Big5. Widely | 69 A coding system to encode the character sets ASCII and Big5. Widely |
| 70 used by Chinese (mainly in Taiwan and Hong Kong). Details are | 70 used for Chinese (mainly in Taiwan and Hong Kong). Details are |
| 71 described in section 4. In this file, when we write "BIG5" | 71 described in section 4. In this file, when we write "BIG5" |
| 72 (all uppercase), we mean the coding system, and when we write | 72 (all uppercase), we mean the coding system, and when we write |
| 73 "Big5" (capitalized), we mean the character set. | 73 "Big5" (capitalized), we mean the character set. |
| 74 | 74 |
| 75 4. Raw text | 75 4. Raw text |
| 76 | 76 |
| 77 A coding system for a text containing random 8-bit code. Emacs does | 77 A coding system for text containing random 8-bit code. Emacs does |
| 78 no code conversion on such a text except for end-of-line format. | 78 no code conversion on such text except for end-of-line format. |
| 79 | 79 |
| 80 5. Other | 80 5. Other |
| 81 | 81 |
| 82 If a user wants to read/write a text encoded in a coding system not | 82 If a user wants to read/write text encoded in a coding system not |
| 83 listed above, he can supply a decoder and an encoder for it in CCL | 83 listed above, he can supply a decoder and an encoder for it as CCL |
| 84 (Code Conversion Language) programs. Emacs executes the CCL program | 84 (Code Conversion Language) programs. Emacs executes the CCL program |
| 85 while reading/writing. | 85 while reading/writing. |
| 86 | 86 |
| 87 Emacs represents a coding system by a Lisp symbol that has a property | 87 Emacs represents a coding system by a Lisp symbol that has a property |
| 88 `coding-system'. But, before actually using the coding system, the | 88 `coding-system'. But, before actually using the coding system, the |
| 91 | 91 |
| 92 */ | 92 */ |
| 93 | 93 |
| 94 /*** GENERAL NOTES on END-OF-LINE FORMAT *** | 94 /*** GENERAL NOTES on END-OF-LINE FORMAT *** |
| 95 | 95 |
| 96 How end-of-line of a text is encoded depends on a system. For | 96 How end-of-line of text is encoded depends on the operating system. |
| 97 instance, Unix's format is just one byte of `line-feed' code, | 97 For instance, Unix's format is just one byte of `line-feed' code, |
| 98 whereas DOS's format is two-byte sequence of `carriage-return' and | 98 whereas DOS's format is two-byte sequence of `carriage-return' and |
| 99 `line-feed' codes. MacOS's format is usually one byte of | 99 `line-feed' codes. MacOS's format is usually one byte of |
| 100 `carriage-return'. | 100 `carriage-return'. |
| 101 | 101 |
| 102 Since text characters encoding and end-of-line encoding are | 102 Since text character encoding and end-of-line encoding are |
| 103 independent, any coding system described above can take | 103 independent, any coding system described above can have any |
| 104 any format of end-of-line. So, Emacs has information of format of | 104 end-of-line format. So Emacs has information about end-of-line |
| 105 end-of-line in each coding-system. See section 6 for more details. | 105 format in each coding-system. See section 6 for more details. |
| 106 | 106 |
| 107 */ | 107 */ |
| 108 | 108 |
| 109 /*** GENERAL NOTES on `detect_coding_XXX ()' functions *** | 109 /*** GENERAL NOTES on `detect_coding_XXX ()' functions *** |
| 110 | 110 |
| 111 These functions check if a text between SRC and SRC_END is encoded | 111 These functions check if a text between SRC and SRC_END is encoded |
| 112 in the coding system category XXX. Each returns an integer value in | 112 in the coding system category XXX. Each returns an integer value in |
| 113 which appropriate flag bits for the category XXX is set. The flag | 113 which appropriate flag bits for the category XXX are set. The flag |
| 114 bits are defined in macros CODING_CATEGORY_MASK_XXX. Below is the | 114 bits are defined in macros CODING_CATEGORY_MASK_XXX. Below is the |
| 115 template of these functions. If MULTIBYTEP is nonzero, 8-bit codes | 115 template for these functions. If MULTIBYTEP is nonzero, 8-bit codes |
| 116 of the range 0x80..0x9F are in multibyte form. */ | 116 of the range 0x80..0x9F are in multibyte form. */ |
| 117 #if 0 | 117 #if 0 |
| 118 int | 118 int |
| 119 detect_coding_emacs_mule (src, src_end, multibytep) | 119 detect_coding_emacs_mule (src, src_end, multibytep) |
| 120 unsigned char *src, *src_end; | 120 unsigned char *src, *src_end; |
| 129 These functions decode SRC_BYTES length of unibyte text at SOURCE | 129 These functions decode SRC_BYTES length of unibyte text at SOURCE |
| 130 encoded in CODING to Emacs' internal format. The resulting | 130 encoded in CODING to Emacs' internal format. The resulting |
| 131 multibyte text goes to a place pointed to by DESTINATION, the length | 131 multibyte text goes to a place pointed to by DESTINATION, the length |
| 132 of which should not exceed DST_BYTES. | 132 of which should not exceed DST_BYTES. |
| 133 | 133 |
| 134 These functions set the information of original and decoded texts in | 134 These functions set the information about original and decoded texts |
| 135 the members produced, produced_char, consumed, and consumed_char of | 135 in the members `produced', `produced_char', `consumed', and |
| 136 the structure *CODING. They also set the member result to one of | 136 `consumed_char' of the structure *CODING. They also set the member |
| 137 CODING_FINISH_XXX indicating how the decoding finished. | 137 `result' to one of CODING_FINISH_XXX indicating how the decoding |
| 138 | 138 finished. |
| 139 DST_BYTES zero means that source area and destination area are | 139 |
| 140 DST_BYTES zero means that the source area and destination area are | |
| 140 overlapped, which means that we can produce a decoded text until it | 141 overlapped, which means that we can produce a decoded text until it |
| 141 reaches at the head of not-yet-decoded source text. | 142 reaches the head of the not-yet-decoded source text. |
| 142 | 143 |
| 143 Below is a template of these functions. */ | 144 Below is a template for these functions. */ |
| 144 #if 0 | 145 #if 0 |
| 145 static void | 146 static void |
| 146 decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes) | 147 decode_coding_XXX (coding, source, destination, src_bytes, dst_bytes) |
| 147 struct coding_system *coding; | 148 struct coding_system *coding; |
| 148 unsigned char *source, *destination; | 149 unsigned char *source, *destination; |
| 152 } | 153 } |
| 153 #endif | 154 #endif |
| 154 | 155 |
| 155 /*** GENERAL NOTES on `encode_coding_XXX ()' functions *** | 156 /*** GENERAL NOTES on `encode_coding_XXX ()' functions *** |
| 156 | 157 |
| 157 These functions encode SRC_BYTES length text at SOURCE of Emacs' | 158 These functions encode SRC_BYTES length text at SOURCE from Emacs' |
| 158 internal multibyte format to CODING. The resulting unibyte text | 159 internal multibyte format to CODING. The resulting unibyte text |
| 159 goes to a place pointed to by DESTINATION, the length of which | 160 goes to a place pointed to by DESTINATION, the length of which |
| 160 should not exceed DST_BYTES. | 161 should not exceed DST_BYTES. |
| 161 | 162 |
| 162 These functions set the information of original and encoded texts in | 163 These functions set the information about original and encoded texts |
| 163 the members produced, produced_char, consumed, and consumed_char of | 164 in the members `produced', `produced_char', `consumed', and |
| 164 the structure *CODING. They also set the member result to one of | 165 `consumed_char' of the structure *CODING. They also set the member |
| 165 CODING_FINISH_XXX indicating how the encoding finished. | 166 `result' to one of CODING_FINISH_XXX indicating how the encoding |
| 166 | 167 finished. |
| 167 DST_BYTES zero means that source area and destination area are | 168 |
| 168 overlapped, which means that we can produce a encoded text until it | 169 DST_BYTES zero means that the source area and destination area are |
| 169 reaches at the head of not-yet-encoded source text. | 170 overlapped, which means that we can produce encoded text until it |
| 170 | 171 reaches at the head of the not-yet-encoded source text. |
| 171 Below is a template of these functions. */ | 172 |
| 173 Below is a template for these functions. */ | |
| 172 #if 0 | 174 #if 0 |
| 173 static void | 175 static void |
| 174 encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes) | 176 encode_coding_XXX (coding, source, destination, src_bytes, dst_bytes) |
| 175 struct coding_system *coding; | 177 struct coding_system *coding; |
| 176 unsigned char *source, *destination; | 178 unsigned char *source, *destination; |
| 258 | 260 |
| 259 | 261 |
| 260 /* Produce a multibyte form of characater C to `dst'. Jump to | 262 /* Produce a multibyte form of characater C to `dst'. Jump to |
| 261 `label_end_of_loop' if there's not enough space at `dst'. | 263 `label_end_of_loop' if there's not enough space at `dst'. |
| 262 | 264 |
| 263 If we are now in the middle of composition sequence, the decoded | 265 If we are now in the middle of a composition sequence, the decoded |
| 264 character may be ALTCHAR (for the current composition). In that | 266 character may be ALTCHAR (for the current composition). In that |
| 265 case, the character goes to coding->cmp_data->data instead of | 267 case, the character goes to coding->cmp_data->data instead of |
| 266 `dst'. | 268 `dst'. |
| 267 | 269 |
| 268 This macro is used in decoding routines. */ | 270 This macro is used in decoding routines. */ |
| 1123 | 1125 |
| 1124 /*** 3. ISO2022 handlers ***/ | 1126 /*** 3. ISO2022 handlers ***/ |
| 1125 | 1127 |
| 1126 /* The following note describes the coding system ISO2022 briefly. | 1128 /* The following note describes the coding system ISO2022 briefly. |
| 1127 Since the intention of this note is to help understand the | 1129 Since the intention of this note is to help understand the |
| 1128 functions in this file, some parts are NOT ACCURATE or OVERLY | 1130 functions in this file, some parts are NOT ACCURATE or are OVERLY |
| 1129 SIMPLIFIED. For thorough understanding, please refer to the | 1131 SIMPLIFIED. For thorough understanding, please refer to the |
| 1130 original document of ISO2022. | 1132 original document of ISO2022. This is equivalent to the standard |
| 1133 ECMA-35, obtainable from <URL:http://www.ecma.ch/> (*). | |
| 1131 | 1134 |
| 1132 ISO2022 provides many mechanisms to encode several character sets | 1135 ISO2022 provides many mechanisms to encode several character sets |
| 1133 in 7-bit and 8-bit environments. For 7-bite environments, all text | 1136 in 7-bit and 8-bit environments. For 7-bit environments, all text |
| 1134 is encoded using bytes less than 128. This may make the encoded | 1137 is encoded using bytes less than 128. This may make the encoded |
| 1135 text a little bit longer, but the text passes more easily through | 1138 text a little bit longer, but the text passes more easily through |
| 1136 several gateways, some of which strip off MSB (Most Signigant Bit). | 1139 several types of gateway, some of which strip off the MSB (Most |
| 1137 | 1140 Signigant Bit). |
| 1138 There are two kinds of character sets: control character set and | 1141 |
| 1139 graphic character set. The former contains control characters such | 1142 There are two kinds of character sets: control character sets and |
| 1143 graphic character sets. The former contain control characters such | |
| 1140 as `newline' and `escape' to provide control functions (control | 1144 as `newline' and `escape' to provide control functions (control |
| 1141 functions are also provided by escape sequences). The latter | 1145 functions are also provided by escape sequences). The latter |
| 1142 contains graphic characters such as 'A' and '-'. Emacs recognizes | 1146 contain graphic characters such as 'A' and '-'. Emacs recognizes |
| 1143 two control character sets and many graphic character sets. | 1147 two control character sets and many graphic character sets. |
| 1144 | 1148 |
| 1145 Graphic character sets are classified into one of the following | 1149 Graphic character sets are classified into one of the following |
| 1146 four classes, according to the number of bytes (DIMENSION) and | 1150 four classes, according to the number of bytes (DIMENSION) and |
| 1147 number of characters in one dimension (CHARS) of the set: | 1151 number of characters in one dimension (CHARS) of the set: |
| 1149 - DIMENSION1_CHARS96 | 1153 - DIMENSION1_CHARS96 |
| 1150 - DIMENSION2_CHARS94 | 1154 - DIMENSION2_CHARS94 |
| 1151 - DIMENSION2_CHARS96 | 1155 - DIMENSION2_CHARS96 |
| 1152 | 1156 |
| 1153 In addition, each character set is assigned an identification tag, | 1157 In addition, each character set is assigned an identification tag, |
| 1154 unique for each set, called "final character" (denoted as <F> | 1158 unique for each set, called the "final character" (denoted as <F> |
| 1155 hereafter). The <F> of each character set is decided by ECMA(*) | 1159 hereafter). The <F> of each character set is decided by ECMA(*) |
| 1156 when it is registered in ISO. The code range of <F> is 0x30..0x7F | 1160 when it is registered in ISO. The code range of <F> is 0x30..0x7F |
| 1157 (0x30..0x3F are for private use only). | 1161 (0x30..0x3F are for private use only). |
| 1158 | 1162 |
| 1159 Note (*): ECMA = European Computer Manufacturers Association | 1163 Note (*): ECMA = European Computer Manufacturers Association |
| 1160 | 1164 |
| 1161 Here are examples of graphic character set [NAME(<F>)]: | 1165 Here are examples of graphic character sets [NAME(<F>)]: |
| 1162 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... | 1166 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... |
| 1163 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... | 1167 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... |
| 1164 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... | 1168 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... |
| 1165 o DIMENSION2_CHARS96 -- none for the moment | 1169 o DIMENSION2_CHARS96 -- none for the moment |
| 1166 | 1170 |
| 1249 7-bit environment, non-locking-shift, and non-single-shift. | 1253 7-bit environment, non-locking-shift, and non-single-shift. |
| 1250 | 1254 |
| 1251 Note (**): If <F> is '@', 'A', or 'B', the intermediate character | 1255 Note (**): If <F> is '@', 'A', or 'B', the intermediate character |
| 1252 '(' can be omitted. We refer to this as "short-form" hereafter. | 1256 '(' can be omitted. We refer to this as "short-form" hereafter. |
| 1253 | 1257 |
| 1254 Now you may notice that there are a lot of ways for encoding the | 1258 Now you may notice that there are a lot of ways of encoding the |
| 1255 same multilingual text in ISO2022. Actually, there exist many | 1259 same multilingual text in ISO2022. Actually, there exist many |
| 1256 coding systems such as Compound Text (used in X11's inter client | 1260 coding systems such as Compound Text (used in X11's inter client |
| 1257 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR | 1261 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR |
| 1258 (used in Korean internet), EUC (Extended UNIX Code, used in Asian | 1262 (used in Korean internet), EUC (Extended UNIX Code, used in Asian |
| 1259 localized platforms), and all of these are variants of ISO2022. | 1263 localized platforms), and all of these are variants of ISO2022. |
| 1275 o ESC '1' -- end composition | 1279 o ESC '1' -- end composition |
| 1276 o ESC '2' -- start rule-base composition (*) | 1280 o ESC '2' -- start rule-base composition (*) |
| 1277 o ESC '3' -- start relative composition with alternate chars (**) | 1281 o ESC '3' -- start relative composition with alternate chars (**) |
| 1278 o ESC '4' -- start rule-base composition with alternate chars (**) | 1282 o ESC '4' -- start rule-base composition with alternate chars (**) |
| 1279 Since these are not standard escape sequences of any ISO standard, | 1283 Since these are not standard escape sequences of any ISO standard, |
| 1280 the use of them for these meaning is restricted to Emacs only. | 1284 the use of them with these meanings is restricted to Emacs only. |
| 1281 | 1285 |
| 1282 (*) This form is used only in Emacs 20.5 and the older versions, | 1286 (*) This form is used only in Emacs 20.5 and older versions, |
| 1283 but the newer versions can safely decode it. | 1287 but the newer versions can safely decode it. |
| 1284 (**) This form is used only in Emacs 21.1 and the newer versions, | 1288 (**) This form is used only in Emacs 21.1 and newer versions, |
| 1285 and the older versions can't decode it. | 1289 and the older versions can't decode it. |
| 1286 | 1290 |
| 1287 Here's a list of examples usages of these composition escape | 1291 Here's a list of example usages of these composition escape |
| 1288 sequences (categorized by `enum composition_method'). | 1292 sequences (categorized by `enum composition_method'). |
| 1289 | 1293 |
| 1290 COMPOSITION_RELATIVE: | 1294 COMPOSITION_RELATIVE: |
| 1291 ESC 0 CHAR [ CHAR ] ESC 1 | 1295 ESC 0 CHAR [ CHAR ] ESC 1 |
| 1292 COMPOSITOIN_WITH_RULE: | 1296 COMPOSITOIN_WITH_RULE: |
| 1309 | 1313 |
| 1310 #define SHIFT_OUT_OK(idx) \ | 1314 #define SHIFT_OUT_OK(idx) \ |
| 1311 (CODING_SPEC_ISO_INITIAL_DESIGNATION (coding_system_table[idx], 1) >= 0) | 1315 (CODING_SPEC_ISO_INITIAL_DESIGNATION (coding_system_table[idx], 1) >= 0) |
| 1312 | 1316 |
| 1313 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions". | 1317 /* See the above "GENERAL NOTES on `detect_coding_XXX ()' functions". |
| 1314 Check if a text is encoded in ISO2022. If it is, returns an | 1318 Check if a text is encoded in ISO2022. If it is, return an |
| 1315 integer in which appropriate flag bits any of: | 1319 integer in which appropriate flag bits any of: |
| 1316 CODING_CATEGORY_MASK_ISO_7 | 1320 CODING_CATEGORY_MASK_ISO_7 |
| 1317 CODING_CATEGORY_MASK_ISO_7_TIGHT | 1321 CODING_CATEGORY_MASK_ISO_7_TIGHT |
| 1318 CODING_CATEGORY_MASK_ISO_8_1 | 1322 CODING_CATEGORY_MASK_ISO_8_1 |
| 1319 CODING_CATEGORY_MASK_ISO_8_2 | 1323 CODING_CATEGORY_MASK_ISO_8_2 |
| 2038 | 2042 |
| 2039 /* ISO2022 encoding stuff. */ | 2043 /* ISO2022 encoding stuff. */ |
| 2040 | 2044 |
| 2041 /* | 2045 /* |
| 2042 It is not enough to say just "ISO2022" on encoding, we have to | 2046 It is not enough to say just "ISO2022" on encoding, we have to |
| 2043 specify more details. In Emacs, each coding system of ISO2022 | 2047 specify more details. In Emacs, each ISO2022 coding system |
| 2044 variant has the following specifications: | 2048 variant has the following specifications: |
| 2045 1. Initial designation to G0 thru G3. | 2049 1. Initial designation to G0 thru G3. |
| 2046 2. Allows short-form designation? | 2050 2. Allows short-form designation? |
| 2047 3. ASCII should be designated to G0 before control characters? | 2051 3. ASCII should be designated to G0 before control characters? |
| 2048 4. ASCII should be designated to G0 at end of line? | 2052 4. ASCII should be designated to G0 at end of line? |
| 2633 } | 2637 } |
| 2634 | 2638 |
| 2635 | 2639 |
| 2636 /*** 4. SJIS and BIG5 handlers ***/ | 2640 /*** 4. SJIS and BIG5 handlers ***/ |
| 2637 | 2641 |
| 2638 /* Although SJIS and BIG5 are not ISO's coding system, they are used | 2642 /* Although SJIS and BIG5 are not ISO coding systems, they are used |
| 2639 quite widely. So, for the moment, Emacs supports them in the bare | 2643 quite widely. So, for the moment, Emacs supports them in the bare |
| 2640 C code. But, in the future, they may be supported only by CCL. */ | 2644 C code. But, in the future, they may be supported only by CCL. */ |
| 2641 | 2645 |
| 2642 /* SJIS is a coding system encoding three character sets: ASCII, right | 2646 /* SJIS is a coding system encoding three character sets: ASCII, right |
| 2643 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded | 2647 half of JISX0201-Kana, and JISX0208. An ASCII character is encoded |
| 2644 as is. A character of charset katakana-jisx0201 is encoded by | 2648 as is. A character of charset katakana-jisx0201 is encoded by |
| 2645 "position-code + 0x80". A character of charset japanese-jisx0208 | 2649 "position-code + 0x80". A character of charset japanese-jisx0208 |
| 2646 is encoded in 2-byte but two position-codes are divided and shifted | 2650 is encoded in 2-byte but two position-codes are divided and shifted |
| 2647 so that it fit in the range below. | 2651 so that it fits in the range below. |
| 2648 | 2652 |
| 2649 --- CODE RANGE of SJIS --- | 2653 --- CODE RANGE of SJIS --- |
| 2650 (character set) (range) | 2654 (character set) (range) |
| 2651 ASCII 0x00 .. 0x7F | 2655 ASCII 0x00 .. 0x7F |
| 2652 KATAKANA-JISX0201 0xA0 .. 0xDF | 2656 KATAKANA-JISX0201 0xA0 .. 0xDF |
| 2656 | 2660 |
| 2657 */ | 2661 */ |
| 2658 | 2662 |
| 2659 /* BIG5 is a coding system encoding two character sets: ASCII and | 2663 /* BIG5 is a coding system encoding two character sets: ASCII and |
| 2660 Big5. An ASCII character is encoded as is. Big5 is a two-byte | 2664 Big5. An ASCII character is encoded as is. Big5 is a two-byte |
| 2661 character set and is encoded in two-byte. | 2665 character set and is encoded in two bytes. |
| 2662 | 2666 |
| 2663 --- CODE RANGE of BIG5 --- | 2667 --- CODE RANGE of BIG5 --- |
| 2664 (character set) (range) | 2668 (character set) (range) |
| 2665 ASCII 0x00 .. 0x7F | 2669 ASCII 0x00 .. 0x7F |
| 2666 Big5 (1st byte) 0xA1 .. 0xFE | 2670 Big5 (1st byte) 0xA1 .. 0xFE |
| 3308 } | 3312 } |
| 3309 | 3313 |
| 3310 | 3314 |
| 3311 /*** 7. C library functions ***/ | 3315 /*** 7. C library functions ***/ |
| 3312 | 3316 |
| 3313 /* In Emacs Lisp, coding system is represented by a Lisp symbol which | 3317 /* In Emacs Lisp, a coding system is represented by a Lisp symbol which |
| 3314 has a property `coding-system'. The value of this property is a | 3318 has a property `coding-system'. The value of this property is a |
| 3315 vector of length 5 (called as coding-vector). Among elements of | 3319 vector of length 5 (called the coding-vector). Among elements of |
| 3316 this vector, the first (element[0]) and the fifth (element[4]) | 3320 this vector, the first (element[0]) and the fifth (element[4]) |
| 3317 carry important information for decoding/encoding. Before | 3321 carry important information for decoding/encoding. Before |
| 3318 decoding/encoding, this information should be set in fields of a | 3322 decoding/encoding, this information should be set in fields of a |
| 3319 structure of type `coding_system'. | 3323 structure of type `coding_system'. |
| 3320 | 3324 |
| 3321 A value of property `coding-system' can be a symbol of another | 3325 The value of the property `coding-system' can be a symbol of another |
| 3322 subsidiary coding-system. In that case, Emacs gets coding-vector | 3326 subsidiary coding-system. In that case, Emacs gets coding-vector |
| 3323 from that symbol. | 3327 from that symbol. |
| 3324 | 3328 |
| 3325 `element[0]' contains information to be set in `coding->type'. The | 3329 `element[0]' contains information to be set in `coding->type'. The |
| 3326 value and its meaning is as follows: | 3330 value and its meaning is as follows: |
| 3360 If `coding->type' is `coding_type_big5', element[4] is t to denote | 3364 If `coding->type' is `coding_type_big5', element[4] is t to denote |
| 3361 BIG5-ETen or nil to denote BIG5-HKU. | 3365 BIG5-ETen or nil to denote BIG5-HKU. |
| 3362 | 3366 |
| 3363 If `coding->type' takes the other value, element[4] is ignored. | 3367 If `coding->type' takes the other value, element[4] is ignored. |
| 3364 | 3368 |
| 3365 Emacs Lisp's coding system also carries information about format of | 3369 Emacs Lisp's coding systems also carry information about format of |
| 3366 end-of-line in a value of property `eol-type'. If the value is | 3370 end-of-line in a value of property `eol-type'. If the value is |
| 3367 integer, 0 means CODING_EOL_LF, 1 means CODING_EOL_CRLF, and 2 | 3371 integer, 0 means CODING_EOL_LF, 1 means CODING_EOL_CRLF, and 2 |
| 3368 means CODING_EOL_CR. If it is not integer, it should be a vector | 3372 means CODING_EOL_CR. If it is not integer, it should be a vector |
| 3369 of subsidiary coding systems of which property `eol-type' has one | 3373 of subsidiary coding systems of which property `eol-type' has one |
| 3370 of above values. | 3374 of the above values. |
| 3371 | 3375 |
| 3372 */ | 3376 */ |
| 3373 | 3377 |
| 3374 /* Extract information for decoding/encoding from CODING_SYSTEM_SYMBOL | 3378 /* Extract information for decoding/encoding from CODING_SYSTEM_SYMBOL |
| 3375 and set it in CODING. If CODING_SYSTEM_SYMBOL is invalid, CODING | 3379 and set it in CODING. If CODING_SYSTEM_SYMBOL is invalid, CODING |
| 3893 The category for a coding system not categorized in any of the | 3897 The category for a coding system not categorized in any of the |
| 3894 above. Assigned the coding-system (Lisp symbol) | 3898 above. Assigned the coding-system (Lisp symbol) |
| 3895 `no-conversion' by default. | 3899 `no-conversion' by default. |
| 3896 | 3900 |
| 3897 Each of them is a Lisp symbol and the value is an actual | 3901 Each of them is a Lisp symbol and the value is an actual |
| 3898 `coding-system's (this is also a Lisp symbol) assigned by a user. | 3902 `coding-system' (this is also a Lisp symbol) assigned by a user. |
| 3899 What Emacs does actually is to detect a category of coding system. | 3903 What Emacs does actually is to detect a category of coding system. |
| 3900 Then, it uses a `coding-system' assigned to it. If Emacs can't | 3904 Then, it uses a `coding-system' assigned to it. If Emacs can't |
| 3901 decide only one possible category, it selects a category of the | 3905 decide a single possible category, it selects a category of the |
| 3902 highest priority. Priorities of categories are also specified by a | 3906 highest priority. Priorities of categories are also specified by a |
| 3903 user in a Lisp variable `coding-category-list'. | 3907 user in a Lisp variable `coding-category-list'. |
| 3904 | 3908 |
| 3905 */ | 3909 */ |
| 3906 | 3910 |
| 4186 utf-16-le. */ | 4190 utf-16-le. */ |
| 4187 | 4191 |
| 4188 static int | 4192 static int |
| 4189 detect_eol_type_in_2_octet_form (source, src_bytes, skip, big_endian_p) | 4193 detect_eol_type_in_2_octet_form (source, src_bytes, skip, big_endian_p) |
| 4190 unsigned char *source; | 4194 unsigned char *source; |
| 4191 int src_bytes, *skip; | 4195 int src_bytes, *skip, big_endian_p; |
| 4192 { | 4196 { |
| 4193 unsigned char *src = source, *src_end = src + src_bytes; | 4197 unsigned char *src = source, *src_end = src + src_bytes; |
| 4194 unsigned int c1, c2; | 4198 unsigned int c1, c2; |
| 4195 int total = 0; /* How many end-of-lines are found so far. */ | 4199 int total = 0; /* How many end-of-lines are found so far. */ |
| 4196 int eol_type = CODING_EOL_UNDECIDED; | 4200 int eol_type = CODING_EOL_UNDECIDED; |
| 6404 return make_number (coding.produced_char); | 6408 return make_number (coding.produced_char); |
| 6405 } | 6409 } |
| 6406 | 6410 |
| 6407 DEFUN ("decode-coding-region", Fdecode_coding_region, Sdecode_coding_region, | 6411 DEFUN ("decode-coding-region", Fdecode_coding_region, Sdecode_coding_region, |
| 6408 3, 3, "r\nzCoding system: ", | 6412 3, 3, "r\nzCoding system: ", |
| 6409 "Decode the current region by specified coding system.\n\ | 6413 "Decode the current region from the specified coding system.\n\ |
| 6410 When called from a program, takes three arguments:\n\ | 6414 When called from a program, takes three arguments:\n\ |
| 6411 START, END, and CODING-SYSTEM. START and END are buffer positions.\n\ | 6415 START, END, and CODING-SYSTEM. START and END are buffer positions.\n\ |
| 6412 This function sets `last-coding-system-used' to the precise coding system\n\ | 6416 This function sets `last-coding-system-used' to the precise coding system\n\ |
| 6413 used (which may be different from CODING-SYSTEM if CODING-SYSTEM is\n\ | 6417 used (which may be different from CODING-SYSTEM if CODING-SYSTEM is\n\ |
| 6414 not fully specified.)\n\ | 6418 not fully specified.)\n\ |
| 6419 return code_convert_region1 (start, end, coding_system, 0); | 6423 return code_convert_region1 (start, end, coding_system, 0); |
| 6420 } | 6424 } |
| 6421 | 6425 |
| 6422 DEFUN ("encode-coding-region", Fencode_coding_region, Sencode_coding_region, | 6426 DEFUN ("encode-coding-region", Fencode_coding_region, Sencode_coding_region, |
| 6423 3, 3, "r\nzCoding system: ", | 6427 3, 3, "r\nzCoding system: ", |
| 6424 "Encode the current region by specified coding system.\n\ | 6428 "Encode the current region into the specified coding system.\n\ |
| 6425 When called from a program, takes three arguments:\n\ | 6429 When called from a program, takes three arguments:\n\ |
| 6426 START, END, and CODING-SYSTEM. START and END are buffer positions.\n\ | 6430 START, END, and CODING-SYSTEM. START and END are buffer positions.\n\ |
| 6427 This function sets `last-coding-system-used' to the precise coding system\n\ | 6431 This function sets `last-coding-system-used' to the precise coding system\n\ |
| 6428 used (which may be different from CODING-SYSTEM if CODING-SYSTEM is\n\ | 6432 used (which may be different from CODING-SYSTEM if CODING-SYSTEM is\n\ |
| 6429 not fully specified.)\n\ | 6433 not fully specified.)\n\ |
