Re: ISO/IEC 10646 as Document Character Set

Martin J Duerst (mduerst@ifi.unizh.ch)
Thu, 4 May 95 09:40:37 EDT

Glenn Adams wrote:

>I have an alternative proposal that just may satisfy the largest
>community (just maybe).
>
>We could say that the default encoding scheme is base line "ISO-2022".
>Furthermore, we could say that, by default, the following initial state
>is to be assumed:
>
>1. ISO 2022 level 1 (ESC 2/0 4/12), i.e.,
> - 8-bit code
> - C0 code element
> - G0 code element having GL shift status
> - SPACE & DELETE
> - optionally a C1 code element in CR
> - G1 code element having GR shift status
>
>2. an initial designation of 8859-1 to G0/G1, i.e.,
> - ESC 2/8 4/2 (ASCII -> G0)
> - ESC 2/13 4/1 (8859-1 -> G1)
>
>In addition, we could say that, for any embedded DOCS (designations of other
>coding systems) data in which byte order is not specified, that a big-endian
>byte order is to be assumed.
>
>Given this specification of a default, both ISO 8859-1 and ISO-2022-JP
>systems could be conformant in the default case. In the latter case, however,
>additional announcers would have to be transmitted or assumed.
>
>Furthermore, since 10646 coded as UCS-2, UCS-4, UTF-8, and UTF-16, etc. all
>can be designated through DOCS, this would also allow the latter to operate
>in the default case (assuming a client could grok 2022 escapes).
>
>How does this compromise sound? Brain dead or what?

Well, to me it sounds like that, to be honest. I hope Glenn will
excuse me to be so direct. The proposal only recognizes 1/3 of the
Japanese reality, and as that is the main area it is concerned about,
that's not really that much. Also, it means that every client has to
implement the full ISO 2022 machinery; even in those areas that
don't use much WWW at all at the moment, it would give them the
message to use ISO 2022, instead of doing it the right
way (read ISO 10646) from the beginning.

What I would like to see is a proposal that pretty soon allows
it to distribute browsers that just know only ISO 10646,
in three to four forms, and that would be able to rely on
proxy-servers to access data that follows old and local customs.
The standard should make a strong commitement
(without being exclusive) to a restricted set of encodings
to guide future developments and to avoid the many-to-many
conversion problem in each browser. The selection of ISO 8859-1 for
one-byte encodings made it the responsibility of the data provider or the
server to get the data into a central standard form, and the
responsibility of the browser to translate this form to its local
encoding. For the European languages, this made things very
clear and simple, and now we can achieve the same worldwide.
It would have strongly hampered WWW in (western) Europe if
all the different encodings from ISO, PC, Mac, NeXT, and so
on would have coexisted, as they still do largely with non-
MIME avare email.

What we should specify then, in my oppinion, is the following
(without exact standard wording):

a) charset has to be specified by the server, if ever possible.
b) If that is missing, ISO 8859-1 is assumed.
c) For worldwide handling, servers are (kindly) requested to
make their data available in one of a very few encodings
(centerd around ISO 10646).
d) For backward compatibility and local convenience, other
codings may be used locally, but have to be specified.
[e) as a small warning to implementers: There is/was also
the practice of using local encodings without specfying
them. This practice is strongly discouraged, but it may
continue to exist for some time in some local areas.]

Comments: The word 'local' makes it clear that this is practice
is restricted (e.g. to Japan), and not suggested for world-
wide communication. (Note that the L10N paper correctly
speaks about multi-LOCALization when speaking about
the actual system they developped, whereas they see a
multiLINGUAL system as the final aim. And multi-localization
is far from world-wide as in World-Wide-Web.)
On the other hand, the points d) and e) recognize current
non-standard practice.

I think that in this way, we can
a) Take the lead and show the direction to go.
b) Keep the door open for the moment for existing stuff.

Regards, Martin.
----
Dr.sc. Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch
$@%F%e!<%k%9%H!&%^!<%F%#%s!&%d%3%V!J%A%e!<%j%C%RBg3X>pJs2J3X2J!K(J
----