Re: HTML/SGML/charsets

Roy T. Fielding (fielding@avron.ICS.UCI.EDU)
Fri, 31 Mar 95 16:01:46 EST

It's obvious that I need to clarify what it was that I changed.

> | > The old language was:
> | > 2.5 Understanding HTML and SGML
> | > HTML is an application of ISO Standard 8879:1986 -
> | > Standard Generalized Markup Language (SGML). SGML is a
> | > system for defining structured document types, and
> | > markup languages to represent instances of those
> | > document types. The SGML declaration for HTML is given
> | > in Section 5.1. It is implicit among HTML user agents.
> | >
> | > If the HTML specification and SGML standard conflict,
> | > the SGML standard is definitive.

Which is an invalid statement for a standards track IETF document.
If there is a discrepancy between the draft and SGML, it must be fixed
unless, that is, ISO would like to place SGML under IETF version control.
If the SGML standard changes, it will not affect the HTML RFC until that
RFC is updated to reflect the change. That is why I deleted it,
and why it will stay deleted.

The only exception to this is in documents that describe the notation
used in the (eventual) RFC. For example, as an editor I don't have to
tell people what a BNF is (though most people will do that in any case).

As it states quite clearly in the Introduction:

This specification defines HTML as an application of ISO Standard
8879:1986 Information Processing Text and Office Systems; Standard
Generalized Markup Language (SGML). SGML provides a formal
definition of the HTML syntax in the form of a Document Type
Definition (DTD).

This specification also defines HTML as an Internet Media Type [7]
and MIME Content Type [4] called "text/html", or
"text/html; version=2.0". As such, it defines the semantics of the
HTML syntax and how that syntax should be interpreted by user agents.

There is no conflict between these two statements. The first requires
that HTML 2.0 be defined with an SGML-conformant syntax. The second
requires that the document explain its interpretation as an Internet
media type.

> The use of SGML to encode HTML docs is an SGML app, and must be
> conformant or it is meaningless.

The syntax must be conformant SGML, yes. However, there is no requirement
whatsoever that HTML also support the million other features found in SGML.
HTML 2.0 is explicitly limited to the features defined in the draft spec.

> No, SGML Open is not a standards body. Apples and oranges. And
> we specifically dealt with character set issues by deciding that
> we wouldn't, for 2.0, and that we'd limit ourselves to 8859-1.
> None of this equivocation is necessary.

A large number of objections were made to that decision, quite vociferously
in some cases, and no satisfactory answer was given as to why 2.0 could
not accommodate more than just ISO-8859-1. I found a way to do this
in the specification while at the same time maintaining (what I thought was)
conformance to SGML.

The only place where this is not the case is in the treatment of &#NNN;
references, and that was due to a mistake on my part -- I did not know that
they were defined by the SGML standard to always refer to the declared
baseset, and thus mistook their purpose in the spec. That needs to be fixed,
and I think Francois' suggestion is sufficient.

> | User agents can (and in some cases, should) bend the rules of SGML
> | in order to provide maximally robust interface to the user. Quite frankly,
> | this is an area that Internet people have had more experience with than
> | SGML people, and I think SGML folks should learn from it just like we
> | have learned the benefits of formally-structured documents.
> If we do not define a conformant DTD, or if we set up a situation in
> which SGML tools will give a different result from HTML UAs when
> processing *valid* HTML (not talking about error recovery here),
> we will have failed to produce a valid HTML spec and will deserve
> all the calumny that will eventually come our way.

We can't have a non-conformant DTD, obviously. I have never suggested
that we should. The section you are referring to only applies when the
user agent receives something it *knows* will be non-conformant to our
HTML 2.0 DTD, because that is what it means when the charset is something
other than ISO-8859-1 or its subsets. Saying that the spec should ignore
that condition is unacceptable to me.

One thing we can do is limit the user agent's flexibility when the
text/html does include an explicit <!DOCTYPE prologue. In other words,
only allow the user agent to selectively choose DTDs when it is already
doing so as part of the default behavior.

> When you speak of "achieving an maximally robust interface to the user"
> *that* sounds like error recovery to me.

Yes, it is recovery from the unexpected. There is precious little
difference between an extension and an error in SGML if the application
has a hardcoded DTD (as do 99.999% of all text/html user agents).

> I suggest breaking out all the UA stuff into a separate document. It
> seems only to be getting in the way of defining an SGML conformant
> interpretation of an incoming HTML doc, which we still haven't done yet.

Absolutely not.

....Roy T. Fielding Department of ICS, University of California, Irvine USA