Re: ISO/IEC 10646 as Document Character Set

Martin J Duerst (mduerst@ifi.unizh.ch)
Thu, 4 May 95 16:07:46 EDT

>This is not a democracy :-) We go by technical arguments, not by
>shouting. I don't see why we need to put ISO10646 as the document
>character set in HTML 2.0. Everybody can do everything they need to
>do -- and reliably -- even if the 2.0 RFC only specifies ISO-8859-1

I could live without a vote, and without ISO10646 being explicitly
the document character set of HTML 2.0, but in that case only
if the wording is some of the paragraphs below is strengthened.
We cannot suggest to leave doors wide open for future incompatibility
problems such as in:

>| ... A minimally conforming HTML user agent must support the SGML
>| declaration in section SGML Declaration for HTML, which specifies ISO
>| Latin 1 (@@full name) as the document character set; it may support
>| other SGML declarations, in particular, SGML declarations with other
>| document character sets.

We don't want whateven falls under "other document character sets".
I do not know if the concept of subset of a document character
set is existing already, but ieally, we want everybody to use a document
character set that contains Latin-1, and that is contained in
ISO 10646, so that any references in the form &#xxxx; are unique,
or may be void in the given document character set.
I am sorry that I don't feel safe enough in SGML terms to suggest
an exact wording, but I am ready to help anybody who does.

[By the way: what happens if such a reference in unknown (e.g.
a reference to something beyond Latin-1 if the document character
set is only Latin-1? Ideally, from an Unicode point of view, it would
be at least ignored for display (but not eliminated when forwarding
the docment again), without creating errors, but probably SGML
has other ideas for this case.]

>What do we gain by putting ISO10646 in there? I think we lose: folks
>may expect browsers to support all of ISO10646 if it's in the spec.
>That would not improve consumer confidence. Putting ISO10646 in the
>spec without discussing conformance is a losing proposition. ISO10646
>carries a lot of issues, in fact: fonts, encodings, and all sorts of
>stuff that's new to many parts of the community. With ISO-8859-1, the
>X window system charted the waters to some extent.

I can agree with the argument of consumer confidece. But we must
in any case avoid that somebody chooses a document character
set not "between" (in the set inclusion sense) Latin-1 and ISO 10646.

>OK. Quick: install those sgmls patches all over the world so I don't
>have to answer the mail about "why doesn't sgmls work any more? My
>documents used to validate, and with the new DTD, they're broken."
>Deploying technology takes time.

How can documents be broken if they used Latin-1 so far and
now have ISO 10646 as a document set? All references of the
form &#xxx; up to xxx=255 will still work.
By the way, it is interesting to note that in strict SGML terms,
the I10L stuff uses (misuses) Latin-1 as a document character
set, just with another display engine. This is especially true
in the case of EUC and SJIS; there might be problems with
the ESC character in the case of JIS, because this is probably
not included in the Latin-1 document character set.

Regards, Martin.

----
Dr.sc. Martin J. Du"rst ' , . p y f g c R l / =
Institut fu"r Informatik a o e U i D h T n S -
der Universita"t Zu"rich ; q j k x b m w v z
Winterthurerstrasse 190 (the Dvorak keyboard)
CH-8057 Zu"rich-Irchel Tel: +41 1 257 43 16
S w i t z e r l a n d Fax: +41 1 363 00 35 Email: mduerst@ifi.unizh.ch
----