Re: HTML/SGML/charsets

Joe English (joe@trystero.art.com)
Fri, 31 Mar 95 13:29:45 EST

eric@spyglass.com (Eric W. Sink) wrote:

> Terry, your recollection of the WG's decision on the charset issue does not
> match mine. You and I were both at the San Jose meeting. I believe that
> the minutes show that we agreed that the MIME specification of the charset
> overrides SGML's specification.

That's basically what the text in the 02 draft,
Section 3.2, "Character Set Issues" says:

Other values for the charset parameter may be defined by the
transport mechanism (e.g., MIME and HTTP), but are not defined by
this specification. Since the SGML declaration for HTML (supplied
in Section 12.3) is only applicable to ISO-8859-1 and its subsets,
a charset parameter that specifies a different character set must
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
also imply a different SGML declaration. Therefore, user agents may
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
use the charset parameter value to select a different declaration,
even though the mechanism for doing so is not defined by this
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
specification. The intent, however, is that such a declaration be
^^^^^^^^^^^^^
as identical as possible to that of Section 12.3, the only
differences being those required to support the announced charset.

This is perfectly acceptable from an SGML processing point of
view. Most HTML user agents will never need to worry about the
SGML declaration, and can just use the MIME charset= parameter.

Agents which *do* need an SGML declaration (and, one presumes,
are capable of handling it correctly) are the only ones which
need to worry about constructing or selecting one, and they
are instructed to do so in a way that's consistent with the MIME
interpretation.

The _problem_ is in section 6.3.2, "Character Octet References":

The character octet references are not dependent on the character
set encoding of the document. For example, "×" always
represents the ISO-8859-1 multiply sign, even when the document's
declared character set is other than ISO-8859-1.

This directly contradicts section 3.2 and/or ISO 8879.

> This is a very simple issue, but very hard to choose. We will either be
> slightly incompatible with SGML, or we will be slightly incompatible with
> MIME.

[ "slightly incompatible" is an oxymoron :-) ]

There is a way to be compatible with both: Keep the
text in 3.2 as it stands (which is kosher from a MIME
point of view, right?) and change 6.3.2 to something
like (emphasis added):

Numeric character references *are* dependent on the character
set encoding of the document. For example, "×" represents
the multiply sign *if* the document's declared character
set is ISO-8859-1.

This would make it SGML compliant as well.

(Are MIME applications allowed to translate documents
from one character set to another in transit? That's
the only way I can think of that this change would
break MIME.)

--Joe English

joe@trystero.art.com