Re: ISO/IEC 10646 as Document Character Set

lilley (lilley@afs.mcc.ac.uk)
Fri, 5 May 95 06:42:14 EDT

Dan said:

> http://www.w3.org/hypertext/WWW/MarkUp/html-spec/html-spec_2.html#SEC8
> |HTML Lexical Syntax
> |
> | ... A minimally conforming HTML user agent must support the SGML
> | declaration in section SGML Declaration for HTML, which specifies ISO
> | Latin 1 (@@full name) as the document character set; it may support
> | other SGML declarations, in particular, SGML declarations with other
> | document character sets.

Right, so a UA which supported, say, Latin-1 and HP-Roman8 and SJIS as
document character sets - perhaps even with Roman8 as the default -
would be a conforming UA by this spec but be somewhat screwed when 2.1,
3.0 or n.n as n tends to infinity specifies 10646 as *the* document
character set? How would the fact that a given document uses a
different SGML declaration be communicated to the client?

I feel that this extract sets a dangerous precedent and opens the door
to all sorts of bizarre document character sets. A nightmare for the future.

I would be much happier if the SGML declaration specified 10646 as *the*
document character set and noted that historical UAs have only used the
first 256 characters of this set, which corresponds to Latin-1. For
compatibility with historical implementations, documents labelled as
text/html; version=2.0 should not contain characters greater than 255.
Future versions of the specification may remove this restriction.

I would then suggest that the spec should say that a minimally
conforming 2.0 UA may choose to support an SGML declaration with Latin-1
as the document character set as this is functionally equivalent.

This produces a spec which not only formalises mid-94 practice (Latin-1
only) but also gives clear hints regarding which way the standardisation
effort is moving.

What do people think of this? [And I am sure you all know what I am
getting at, and will excuse me if I have any lapses from PC terminology
;-) ]

> What do we gain by putting ISO10646 in there? I think we lose: folks
> may expect browsers to support all of ISO10646 if it's in the spec.

Your point is well made, but I think my suggestion covers that one.
Hoist by your own petard; folks can expect all they want, but if the
spec precisely states that only the first 256 characters are supported
then they are expecting wrong.

> Glenn Adams writes
> > [I already gave you the minimum
> > changes needed in the SGML declaration.]

Glenn, could you hack up the changes that would be needed to support
my proposal?

> OK. Quick: install those sgmls patches all over the world so I don't
> have to answer the mail about "why doesn't sgmls work any more? My
> documents used to validate, and with the new DTD, they're broken."
> Deploying technology takes time.

Certainly.

My proposal would take care of that one, although putting the patched
versions of sgmls binaries in the HaL html-check package would be one
good way of disseminating them now, so they would already be in place
when 2.1, 3.0 or whatever rolls out with support for 10646 characters
greater than 255. But those changes would not need to be exercised,
with my proposal.

> HTML 2.0 is a well known quantity in a fairly large community.

I beg to differ. There is no HTML 2.0 until it has been published.
Working code, and loads of documents, are the known quantity in a large
community. What the spec-to-be does is formalise existing practice.
The practice existed before the spec.

Now that we have satisfactorily distinguished between the character
encoding scheme used to transport the documents and the document
character set which is used in the HTML SGML declaration, there is no
impact on existing users if we say that the document character set is
Latin-1 or say it is the first 256 characters in 10646. Choosing the
latter option provides the cleanest upgrade path for full
multinationalisation support in the future, as I think we all agree.

--
Chris Lilley, Technical Author
+-------------------------------------------------------------------+
|       Manchester and North HPC Training & Education Centre        |
+-------------------------------------------------------------------+
| Computer Graphics Unit,             Email: Chris.Lilley@mcc.ac.uk |
| Manchester Computing Centre,        Voice: +44 61 275 6045        |
| Oxford Road, Manchester, UK.          Fax: +44 61 275 6040        |
| M13 9PL                            BioMOO: ChrisL                 |
|     URI: http://info.mcc.ac.uk/CGU/staff/lilley/lilley.html       | 
+-------------------------------------------------------------------+
|     "The first W in WWW will not wait."   François Yergeau        |
+-------------------------------------------------------------------+