Re: HTML 2.0 comments (First of two)

Daniel W. Connolly (connolly@hal.com)
Wed, 23 Nov 94 16:09:12 EST

In message <199411231854.OAA14495@postman.osf.org>, "Sandra Martin O'Donnell" w
rites:
>I recently had a chance to read the HTML 2.0 specification, and
>have some serious concerns about its design with respect to
>internationalization (I18N) issues.

Believe me: we are all concerned about I18N.

>COMMENTS ON HTML SPECIFICATION -- 2.0
>(First of two)
>
>After reading the HTML spec, I have one overall concern that
>affects many sections. Currently, the code set ISO 8859-1
>(Latin-1) is listed as the one that HTML supports. The spec
>permits documents to include any Latin-1 character, and lists
>the entity names and encoded values for each character in the
>Latin-1 repertoire.
>
>I assume the working group did this to address international
>needs as you saw them.

Not at all. We wrote that down because that's the way it works. The
charter/purpose of the HTML 2.0 spec is:

*** to describe current practice ***

All this moaning and whailing and gnashing of teeth for the last 6
months is _just_ to describe precicely what goes on today, or some
tractible subset of what goes on today.

Today, browsers do ISO8859-1.

We have participated in -- even initiated -- numerous discussions on
www-talk@info.cern.ch, www-html@info.cern.ch, comp.text.sgml, etc.
regarding the issues of other character sets and encodings,
non-western writing systems, etc.; e.g.:

WWW Talk Jul 94-present: Putting the "World" back in WWW...
http://gummo.stanford.edu/html/hypermail/www-talk-1994q3/1212.html

There is no widely agreed-upon solution today. I suspect that some
Unicode/UTF-8 based solution will be deployed soon.

This working group has shied away from design work. I prefer the more
darwinian approach: let the researchers explore different strategies,
and see which one becomes popular. Then clean it up and standardize.

Design by committee is one of the great evils of the world today!

That said, I'd like to address two issues you raise:

(1) that the current document does not allow for extensibility in
this area, and

(2) specific proposals for I18N.

>But the way the spec is written makes it difficult or impossible
>to support anything other than Latin-1. That's because you've
>allowed numeric character values to be used for the Latin-1
>characters. The problem is that many code sets use the same
>numeric values for their own characters, but since HTML says
>the values are Latin-1 and only Latin-1, these other code sets
>can't be supported.

Numeric character references is an SGML mechanism, not something the
HTML community made up. SGML numeric character references are _always_
interpreted in the context of _some_ SGML declaration which specifies
the document charcter set. In HTML 2.0, all documents share the same,
implicit SGML declaration, which specifies ISO Latin 1.

Hence &#225; only indicates a-grave as long as the SGML declaration says
that the document character set is ISO8859-1.

I believe the consensus of the working group is that we should reserve
the "charset" parameter of the text/html MIME media type for future
use. We intend to specify ways to use other character sets and encodings
in HTML documents, once we have a suitable base of experience built up.

This future use might be, for example:

Content-Type: text/html; charset="UTF-8"

which would cause the user agent to assume a different SGML declaration
from the HTML 2.0 SGML declaration.

It is evident from your confusion that the current document doesn't
make this business clear.

I nominate Murray Maloney to take the action to get this cleared up.
He has submitted blurbs that explain this coherently more than once. Somehow,
they keep getting lost in the noise.

Eric: perhaps you could lend Murray a copy of the Frame source for a
few days and let him actually hack it in there?

>What to do about this? There are three options:
>
>1. Do nothing. This means HTML will only support Latin-1.
>That may be good enough for your community of users now,
>but it is not if you want more of the world's users to be
>able to mark up documents. If the spec remains as it is, and
>you later want to add support for more of the world, HTML
>will almost certainly have to change in some probably
>incompatible way.

For the 2.0 RFC, this is what I expect we will do. Well... we should
make it clear that there will be (upward compatible) changes in the
area of character sets and encodings, but we will not actually specify
any mechanisms.

>2. Use the universal code set ISO 10646 (basically the same
>as Unicode) for numeric character values.

A distinct possibility for HTML 2.1 and beyond...

>If I'm writing in Japanese, however, and can only refer to
>characters by their numeric values, the source is
>incomprehensible. It would look something like:
>
> /* random values for example only */
> <P>6e206e437934141
> <P>973387b4ff419932fff8</P>

I would expect to use the UTF-8 encoding of Unicode characters. Yes,
Japanese text would be incomprehensible to ASCII-based text viewers. I
don't see this as a problem.

>3. Remove the ability to refer to characters by their
>numeric values and instead add a tag that designates the
>code set for the document.

No can do. HTML is an SGML application. This is part of SGML.

Thanks for your valuable comments. I hope you will stay in the loop
as we discuss character sets, languages, and non-western writing systems
in future versions of the spec.

Dan