Re: Objections to draft-ietf-html-spec-01.txt

Albert Lunde (Albert-Lunde@nwu.edu)
Tue, 21 Mar 95 18:21:07 EST

At 12:21 PM 3/21/95, Larry Masinter wrote:
>As the author of those words, I'll admit that they're unnecessary and
>more prone to introduce confusion than insight. I think all that's
>necessary is to strike the "rather than relying on any SGML mechanism
>for doing so."
>
>"It is evisioned that HTML will use the charset parameter to allow
>support for non-Latin characters such as Greek, Arabic, Hebrew and
>Japanese."
>
>However, Terry's further explanation that "HTML cannot conform to the
>SGML standard, ISO 8879, if the charset encodings are specified by
>some other means" is another point that I _thought_ we'd gone over,
>but I'll have to review the archives to find the points again.

At 1:12 PM 3/21/95, James D Mason wrote:
>I also second Terry's position. If we're trying to make HTML a better
>application of SGML and so ease the lives of those of us who use HTML as an
>output form into which to render documents done in other SGML applications, we
>should use only the mechanism specified in the standard.

I think what we are running into is a conflict between our attempts to
conform with MIME and our attempts to conform with SGML.

I think Larry introduced the language in question in an attempt to make the
HTML 2.0 spec work for other single character sets than ISO Latin-1, using
a MIME-like character set parameter. (This was in part a way to
postpone/avoid getting into general multilingual issues by indicating a
mechanism for the simple MIME-like cases.)

I still think this is a direction to go. I don't think putting the
character set declaration in the body of a document makes sense in the
context of current versions of HTML and HTML (and the 2.0 spec needs to
stay close to current practice, which doesn't put much of the SGML "stuff"
in the document.)

On the other hand, it suggests, that to satisfy SGML mavens we at least
need to specify a mechanism/algolrithm to derive an SGML declaration for
other character sets than ISO-Latin-1.

ERCS may be a way to do this. (Define character classes for Unicode and
project downward to subsets.)

There are simpler mechanisms that may work for US-ASCII-like character sets
(Define character classes on ASCII or Latin-1 and lump everything else
together somehow) (which is how I guess implementations of multilingual WWW
are actually working now.)

One problem I see is that choosing this simpler method to derive the SGML
stuff is that it might foreclose or complicate options to use ERCS later to
provide better SGML stuff for Unicode.

In any case this question raises some of the characterset/multilingual
issues again: we don't have to solve them all, but it seems good to look at
the implications.

We could also abandon an attempt to define what the charset parameter
really means in the HTML 2.0 spec and indicate that clients should not
choke on it (thought this is really an HTTP issue). But this would make it
rather urgent to deal with for 2.1, at least in the simple MIME-like case.

---
    Albert Lunde                      Albert-Lunde@nwu.edu