Re: Revised language on: ISO/IEC 10646 as Document Character Set

Martin J Duerst (mduerst@ifi.unizh.ch)
Tue, 9 May 95 07:50:32 EDT

I am trying to hurry up to get things straight here from
Europe before anybody in the US has to worry about this.

>If you convert iso-2022-jp to 10646 and then back again to iso-2022-jp,
>you could end up with a file that is different from the original
>iso-2022-jp document. For example, some of the "double-width" Roman
>characters in the JIS X 0208 portion of iso-2022-jp are not in 10646.

This is not true. The "double-width" Latin characters are in the
compatibility section (U+FF00 onwards). Special care has been taken
when designing Unicode/ISO 10646 to avoid such problems.

>Also, you could lose info encoded in the escape sequences themselves:
>ESC ( B, ESC ( J, ESC $ @ and ESC $ B. So iso-2022-jp is not a subset
>of 10646, if you look at it this way.

First, the term of subset applies only to the characters themselves,
not to any encoding quirks. In that sense, it doesn't apply to the term
charset as used in standards (see Dan's article on "Character set considered
harmful").

Second, officially only for the last version of JIS 0208, namely
JIS 0208-1990, round trip is guaranteed. The official (but to my
knowledge virtually never used) escape sequence for this is
ESC & @ ESC $ B. In practice, ISO 10646 guarantees round-trip
to any of the following combinations of one from (a-c) and one
from (A-B):
a) JIS C 6226-1978 (ESC $ @)
b) JIS X 0208-1983 (ESC $ B)
c) JIS X 0208-1990 (ESC & @ ESC $ B)

A) JIS-Roman (JIS 0201, ESC ( J)
B) ASCII (ESC ( B)

The information on which combination was used is lost, but the
file can be restored in its entiety if you know which combination
was used. The same happens if you go from JIS to SJIS or EUC
and back; you will not automatically know which combination is used.
For practical purposes, this is not relevant; files are converted
frequently to SJIS/EUC and back. Also, I do not know of any
system that would use a combination of a-c or A-B in a single
file with its full consequences.

>Why not just remove the restriction that the HTTP charset has to be a
>subset of 10646? I.e. remove the word "subset" somehow.

a) see above: it is not necessary to remove "subset".
b) if it can be anything, communication will be very difficult.

Regards, Martin.
----
Dr.sc. Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch
----