Mercurial > emacs
comparison src/coding.c @ 24425:61c6b3be1d51
Comment for ISO 2022 encoding mechanism modified.
| author | Kenichi Handa <handa@m17n.org> |
|---|---|
| date | Mon, 01 Mar 1999 11:52:54 +0000 |
| parents | 8b7ef7fb9e2e |
| children | be35d27a4bfb |
comparison
equal
deleted
inserted
replaced
| 24424:520e8f39c1f8 | 24425:61c6b3be1d51 |
|---|---|
| 523 | 523 |
| 524 | 524 |
| 525 /*** 3. ISO2022 handlers ***/ | 525 /*** 3. ISO2022 handlers ***/ |
| 526 | 526 |
| 527 /* The following note describes the coding system ISO2022 briefly. | 527 /* The following note describes the coding system ISO2022 briefly. |
| 528 Since the intention of this note is to help in understanding of | 528 Since the intention of this note is to help understand the |
| 529 the programs in this file, some parts are NOT ACCURATE or OVERLY | 529 functions in this file, some parts are NOT ACCURATE or OVERLY |
| 530 SIMPLIFIED. For the thorough understanding, please refer to the | 530 SIMPLIFIED. For thorough understanding, please refer to the |
| 531 original document of ISO2022. | 531 original document of ISO2022. |
| 532 | 532 |
| 533 ISO2022 provides many mechanisms to encode several character sets | 533 ISO2022 provides many mechanisms to encode several character sets |
| 534 in 7-bit and 8-bit environment. If one chooses 7-bite environment, | 534 in 7-bit and 8-bit environments. For 7-bite environments, all text |
| 535 all text is encoded by codes of less than 128. This may make the | 535 is encoded using bytes less than 128. This may make the encoded |
| 536 encoded text a little bit longer, but the text gets more stability | 536 text a little bit longer, but the text passes more easily through |
| 537 to pass through several gateways (some of them strip off the MSB). | 537 several gateways, some of which strip off MSB (Most Signigant Bit). |
| 538 | 538 |
| 539 There are two kinds of character set: control character set and | 539 There are two kinds of character sets: control character set and |
| 540 graphic character set. The former contains control characters such | 540 graphic character set. The former contains control characters such |
| 541 as `newline' and `escape' to provide control functions (control | 541 as `newline' and `escape' to provide control functions (control |
| 542 functions are provided also by escape sequences). The latter | 542 functions are also provided by escape sequences). The latter |
| 543 contains graphic characters such as ' A' and '-'. Emacs recognizes | 543 contains graphic characters such as 'A' and '-'. Emacs recognizes |
| 544 two control character sets and many graphic character sets. | 544 two control character sets and many graphic character sets. |
| 545 | 545 |
| 546 Graphic character sets are classified into one of the following | 546 Graphic character sets are classified into one of the following |
| 547 four classes, DIMENSION1_CHARS94, DIMENSION1_CHARS96, | 547 four classes, according to the number of bytes (DIMENSION) and |
| 548 DIMENSION2_CHARS94, DIMENSION2_CHARS96 according to the number of | 548 number of characters in one dimension (CHARS) of the set: |
| 549 bytes (DIMENSION) and the number of characters in one dimension | 549 - DIMENSION1_CHARS94 |
| 550 (CHARS) of the set. In addition, each character set is assigned an | 550 - DIMENSION1_CHARS96 |
| 551 identification tag (called "final character" and denoted as <F> | 551 - DIMENSION2_CHARS94 |
| 552 here after) which is unique in each class. <F> of each character | 552 - DIMENSION2_CHARS96 |
| 553 set is decided by ECMA(*) when it is registered in ISO. Code range | 553 |
| 554 of <F> is 0x30..0x7F (0x30..0x3F are for private use only). | 554 In addition, each character set is assigned an identification tag, |
| 555 unique for each set, called "final character" (denoted as <F> | |
| 556 hereafter). The <F> of each character set is decided by ECMA(*) | |
| 557 when it is registered in ISO. The code range of <F> is 0x30..0x7F | |
| 558 (0x30..0x3F are for private use only). | |
| 555 | 559 |
| 556 Note (*): ECMA = European Computer Manufacturers Association | 560 Note (*): ECMA = European Computer Manufacturers Association |
| 557 | 561 |
| 558 Here are examples of graphic character set [NAME(<F>)]: | 562 Here are examples of graphic character set [NAME(<F>)]: |
| 559 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... | 563 o DIMENSION1_CHARS94 -- ASCII('B'), right-half-of-JISX0201('I'), ... |
| 560 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... | 564 o DIMENSION1_CHARS96 -- right-half-of-ISO8859-1('A'), ... |
| 561 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... | 565 o DIMENSION2_CHARS94 -- GB2312('A'), JISX0208('B'), ... |
| 562 o DIMENSION2_CHARS96 -- none for the moment | 566 o DIMENSION2_CHARS96 -- none for the moment |
| 563 | 567 |
| 564 A code area (1byte=8bits) is divided into 4 areas, C0, GL, C1, and GR. | 568 A code area (1 byte=8 bits) is divided into 4 areas, C0, GL, C1, and GR. |
| 565 C0 [0x00..0x1F] -- control character plane 0 | 569 C0 [0x00..0x1F] -- control character plane 0 |
| 566 GL [0x20..0x7F] -- graphic character plane 0 | 570 GL [0x20..0x7F] -- graphic character plane 0 |
| 567 C1 [0x80..0x9F] -- control character plane 1 | 571 C1 [0x80..0x9F] -- control character plane 1 |
| 568 GR [0xA0..0xFF] -- graphic character plane 1 | 572 GR [0xA0..0xFF] -- graphic character plane 1 |
| 569 | 573 |
| 570 A control character set is directly designated and invoked to C0 or | 574 A control character set is directly designated and invoked to C0 or |
| 571 C1 by an escape sequence. The most common case is that ISO646's | 575 C1 by an escape sequence. The most common case is that: |
| 572 control character set is designated/invoked to C0 and ISO6429's | 576 - ISO646's control character set is designated/invoked to C0, and |
| 573 control character set is designated/invoked to C1, and usually | 577 - ISO6429's control character set is designated/invoked to C1, |
| 574 these designations/invocations are omitted in a coded text. With | 578 and usually these designations/invocations are omitted in encoded |
| 575 7-bit environment, only C0 can be used, and a control character for | 579 text. In a 7-bit environment, only C0 can be used, and a control |
| 576 C1 is encoded by an appropriate escape sequence to fit in the | 580 character for C1 is encoded by an appropriate escape sequence to |
| 577 environment. All control characters for C1 are defined the | 581 fit into the environment. All control characters for C1 are |
| 578 corresponding escape sequences. | 582 defined to have corresponding escape sequences. |
| 579 | 583 |
| 580 A graphic character set is at first designated to one of four | 584 A graphic character set is at first designated to one of four |
| 581 graphic registers (G0 through G3), then these graphic registers are | 585 graphic registers (G0 through G3), then these graphic registers are |
| 582 invoked to GL or GR. These designations and invocations can be | 586 invoked to GL or GR. These designations and invocations can be |
| 583 done independently. The most common case is that G0 is invoked to | 587 done independently. The most common case is that G0 is invoked to |
| 584 GL, G1 is invoked to GR, and ASCII is designated to G0, and usually | 588 GL, G1 is invoked to GR, and ASCII is designated to G0. Usually |
| 585 these invocations and designations are omitted in a coded text. | 589 these invocations and designations are omitted in encoded text. |
| 586 With 7-bit environment, only GL can be used. | 590 In a 7-bit environment, only GL can be used. |
| 587 | 591 |
| 588 When a graphic character set of CHARS94 is invoked to GL, code 0x20 | 592 When a graphic character set of CHARS94 is invoked to GL, codes |
| 589 and 0x7F of GL area work as control characters SPACE and DEL | 593 0x20 and 0x7F of the GL area work as control characters SPACE and |
| 590 respectively, and code 0xA0 and 0xFF of GR area should not be used. | 594 DEL respectively, and codes 0xA0 and 0xFF of the GR area should not |
| 595 be used. | |
| 591 | 596 |
| 592 There are two ways of invocation: locking-shift and single-shift. | 597 There are two ways of invocation: locking-shift and single-shift. |
| 593 With locking-shift, the invocation lasts until the next different | 598 With locking-shift, the invocation lasts until the next different |
| 594 invocation, whereas with single-shift, the invocation works only | 599 invocation, whereas with single-shift, the invocation affects the |
| 595 for the following character and doesn't affect locking-shift. | 600 following character only and doesn't affect the locking-shift |
| 596 Invocations are done by the following control characters or escape | 601 state. Invocations are done by the following control characters or |
| 597 sequences. | 602 escape sequences: |
| 598 | 603 |
| 599 ---------------------------------------------------------------------- | 604 ---------------------------------------------------------------------- |
| 600 function control char escape sequence description | 605 abbrev function cntrl escape seq description |
| 601 ---------------------------------------------------------------------- | 606 ---------------------------------------------------------------------- |
| 602 SI (shift-in) 0x0F none invoke G0 to GL | 607 SI/LS0 (shift-in) 0x0F none invoke G0 into GL |
| 603 SO (shift-out) 0x0E none invoke G1 to GL | 608 SO/LS1 (shift-out) 0x0E none invoke G1 into GL |
| 604 LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL | 609 LS2 (locking-shift-2) none ESC 'n' invoke G2 into GL |
| 605 LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL | 610 LS3 (locking-shift-3) none ESC 'o' invoke G3 into GL |
| 606 SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 into GL | 611 LS1R (locking-shift-1 right) none ESC '~' invoke G1 into GR (*) |
| 607 SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 into GL | 612 LS2R (locking-shift-2 right) none ESC '}' invoke G2 into GR (*) |
| 613 LS3R (locking-shift 3 right) none ESC '|' invoke G3 into GR (*) | |
| 614 SS2 (single-shift-2) 0x8E ESC 'N' invoke G2 for one char | |
| 615 SS3 (single-shift-3) 0x8F ESC 'O' invoke G3 for one char | |
| 608 ---------------------------------------------------------------------- | 616 ---------------------------------------------------------------------- |
| 609 The first four are for locking-shift. Control characters for these | 617 (*) These are not used by any known coding system. |
| 610 functions are defined by macros ISO_CODE_XXX in `coding.h'. | 618 |
| 611 | 619 Control characters for these functions are defined by macros |
| 612 Designations are done by the following escape sequences. | 620 ISO_CODE_XXX in `coding.h'. |
| 621 | |
| 622 Designations are done by the following escape sequences: | |
| 613 ---------------------------------------------------------------------- | 623 ---------------------------------------------------------------------- |
| 614 escape sequence description | 624 escape sequence description |
| 615 ---------------------------------------------------------------------- | 625 ---------------------------------------------------------------------- |
| 616 ESC '(' <F> designate DIMENSION1_CHARS94<F> to G0 | 626 ESC '(' <F> designate DIMENSION1_CHARS94<F> to G0 |
| 617 ESC ')' <F> designate DIMENSION1_CHARS94<F> to G1 | 627 ESC ')' <F> designate DIMENSION1_CHARS94<F> to G1 |
| 630 ESC '$' '.' <F> designate DIMENSION2_CHARS96<F> to G2 | 640 ESC '$' '.' <F> designate DIMENSION2_CHARS96<F> to G2 |
| 631 ESC '$' '/' <F> designate DIMENSION2_CHARS96<F> to G3 | 641 ESC '$' '/' <F> designate DIMENSION2_CHARS96<F> to G3 |
| 632 ---------------------------------------------------------------------- | 642 ---------------------------------------------------------------------- |
| 633 | 643 |
| 634 In this list, "DIMENSION1_CHARS94<F>" means a graphic character set | 644 In this list, "DIMENSION1_CHARS94<F>" means a graphic character set |
| 635 of dimension 1, chars 94, and final character <F>, and etc. | 645 of dimension 1, chars 94, and final character <F>, etc... |
| 636 | 646 |
| 637 Note (*): Although these designations are not allowed in ISO2022, | 647 Note (*): Although these designations are not allowed in ISO2022, |
| 638 Emacs accepts them on decoding, and produces them on encoding | 648 Emacs accepts them on decoding, and produces them on encoding |
| 639 CHARS96 character set in a coding system which is characterized as | 649 CHARS96 character sets in a coding system which is characterized as |
| 640 7-bit environment, non-locking-shift, and non-single-shift. | 650 7-bit environment, non-locking-shift, and non-single-shift. |
| 641 | 651 |
| 642 Note (**): If <F> is '@', 'A', or 'B', the intermediate character | 652 Note (**): If <F> is '@', 'A', or 'B', the intermediate character |
| 643 '(' can be omitted. We call this as "short-form" here after. | 653 '(' can be omitted. We refer to this as "short-form" hereafter. |
| 644 | 654 |
| 645 Now you may notice that there are a lot of ways for encoding the | 655 Now you may notice that there are a lot of ways for encoding the |
| 646 same multilingual text in ISO2022. Actually, there exists many | 656 same multilingual text in ISO2022. Actually, there exist many |
| 647 coding systems such as Compound Text (used in X's inter client | 657 coding systems such as Compound Text (used in X11's inter client |
| 648 communication, ISO-2022-JP (used in Japanese Internet), ISO-2022-KR | 658 communication, ISO-2022-JP (used in Japanese internet), ISO-2022-KR |
| 649 (used in Korean Internet), EUC (Extended UNIX Code, used in Asian | 659 (used in Korean internet), EUC (Extended UNIX Code, used in Asian |
| 650 localized platforms), and all of these are variants of ISO2022. | 660 localized platforms), and all of these are variants of ISO2022. |
| 651 | 661 |
| 652 In addition to the above, Emacs handles two more kinds of escape | 662 In addition to the above, Emacs handles two more kinds of escape |
| 653 sequences: ISO6429's direction specification and Emacs' private | 663 sequences: ISO6429's direction specification and Emacs' private |
| 654 sequence for specifying character composition. | 664 sequence for specifying character composition. |
| 655 | 665 |
| 656 ISO6429's direction specification takes the following format: | 666 ISO6429's direction specification takes the following form: |
| 657 o CSI ']' -- end of the current direction | 667 o CSI ']' -- end of the current direction |
| 658 o CSI '0' ']' -- end of the current direction | 668 o CSI '0' ']' -- end of the current direction |
| 659 o CSI '1' ']' -- start of left-to-right text | 669 o CSI '1' ']' -- start of left-to-right text |
| 660 o CSI '2' ']' -- start of right-to-left text | 670 o CSI '2' ']' -- start of right-to-left text |
| 661 The control character CSI (0x9B: control sequence introducer) is | 671 The control character CSI (0x9B: control sequence introducer) is |
| 662 abbreviated to the escape sequence ESC '[' in 7-bit environment. | 672 abbreviated to the escape sequence ESC '[' in a 7-bit environment. |
| 663 | 673 |
| 664 Character composition specification takes the following format: | 674 Character composition specification takes the following form: |
| 665 o ESC '0' -- start character composition | 675 o ESC '0' -- start character composition |
| 666 o ESC '1' -- end character composition | 676 o ESC '1' -- end character composition |
| 667 Since these are not standard escape sequences of any ISO, the use | 677 Since these are not standard escape sequences of any ISO standard, |
| 668 of them for these meaning is restricted to Emacs only. */ | 678 the use of them for these meaning is restricted to Emacs only. */ |
| 669 | 679 |
| 670 enum iso_code_class_type iso_code_class[256]; | 680 enum iso_code_class_type iso_code_class[256]; |
| 671 | 681 |
| 672 #define CHARSET_OK(idx, charset) \ | 682 #define CHARSET_OK(idx, charset) \ |
| 673 (coding_system_table[idx] \ | 683 (coding_system_table[idx] \ |
