Re: Charset labelling (Was: Comments on: "Character Set" Considered

Glenn Adams (glenn@stonehand.com)
Fri, 28 Apr 95 14:39:26 EDT

Date: Fri, 28 Apr 1995 12:36:58 -0500
From: "Dave Saunders" <dave@intercon.com>

The one thing that seemed to be a limitation of the ISO 8879:1986 Annex E.3
stuff was that once you switched out of Latin1, you couldn't get back in.
However that could be just because the existing browsers hacked to do 2022
have that limitation.

It would have to be the latter (implementation), 8879 E.3 admits
all of the ISO 2022 techniques: announcement, designation, invocation,
etc. The following example shows how to switch from 8859-1 to Unicode (as
UCS-2) and back would be something like (assuming an initial state of
ISO 2022 level 1, an initial designation of 8859-1 to G0/G1, and big-
endian byte order):

41 A
C1 A ACUTE
1B 25 2F 40 ESC 2/5 2/15 4/0 (designate UCS-2 Coding System)
00 41 A
00 C1 A ACUTE
0E 01 THAI KO KAI
06 21 ARABIC HAMZA
00 1B 00 25 00 40 ESC 2/5 4/0 (return to ISO 2022 Coding System)
41 A
C1 A ACUTE

However, a much simpler solution for the same data would be either

(1) UCS-2:

00 41 A
00 C1 A ACUTE
00 41 A
00 C1 A ACUTE
0E 01 THAI KO KAI
06 21 ARABIC HAMZA
00 41 A
00 C1 A ACUTE

or (2) UTF-8:

41 A
C3 81 A ACUTE
41 A
C3 81 A ACUTE
E0 B8 81 THAI KO KAI
D8 A1 ARABIC HAMZA
41 A
C3 81 A ACUTE

The advantage of UCS-2 being its fixed width, and the advantage of UTF-8
being its interchange compatibility with C string type implementations, etc.

Regards,
Glenn