Re: New draft: charset, conformance cleanup

Roy T. Fielding (fielding@avron.ICS.UCI.EDU)
Mon, 3 Apr 95 08:49:37 EDT

Boy, Cynthia is going to be pissed at us. ;-)

> attribute
> A name/value pair: part of an element which is often used
> to specify a characteristic quality of the element, other than
> type or content.

Is that sufficient to cover minimized attributes (e.g. <UL COMPACT>)?

> character encoding
> A mapping from sequences of octets to sequences of characters
> from a character repertiore; that is, a sequence of octets and a
> character encoding determines a sequence of characters.

repertiore => repertoire

> entity
> A text entity, or some other data with an associated notation or
> interpretation; for example, a sequence of octets associated
> with an Internet Media Type.

That's a recursive definition. It would make more sense for "text entity"
to refer to this definition, rather than the other way around. I think
that "a sequence of octets with a defined format" is sufficient.

> MIME entity
> a head and body. The head is a collection of name/value fields,
> and the body is a sequence of octets. The head defines the
> content type and content transfer encoding of the body.

I would prefer "message entity", since HTTP uses it as well.

> minimally conforming HTML user agent
> A user agent that conforms to this specification in its
> treatment of the Internet Media Type "text/html; level=0;
> version=2.0"

Is this used? Is it necessary?

> SGML
> Standard Generalized Markup Language [12] (see also [9] and [6])
> is a system for describing docyment types and markup languages
> to represent them.

docyment => document

> 2. HTML as an Application of SGML
>
> HTML is an application of ISO Standard 8879:1986 -- Standard
> Generalized Markup Language (SGML) [12]. SGML is a system for
> defining structured document types and markup languages to
> represent instances of those document types. The SGML declaration
> for HTML and the HTML document type definitions (DTDs) are provided
> in Section 12.
>
> The term "HTML" refers to both the document type defined here and
> the markup language for representing instances of this document
> type.
>
> If this specification and the SGML standard conflict,
> the SGML standard is definitive.

As I've said, this statement is not appropriate. If it remains in the
document, I will ask that the IESG remove it on the grounds that it
introduces an unacceptable loophole in the definition of an Internet
standard. Moreover, it is completely unnecessary.

If you want to say something like this, the most you can say is:

If the description of SGML presented in this specification conflicts
with the SGML standard as represented by ISO 8879:1986 [12], then
the SGML standard is definitive.

Anything more and you might as well just say "HTML is SGML" and throw
away the rest of the spec. Naturally, the spec itself will then be ignored.

> 2.1 SGML Documents

very nice ....

> The start symbol of the DTD grammar is HTML, and the productions
> are given in the public text identified by "-//IETF//DTD HTML
> 2.0//EN", that is section 13.3@@. Hence the terminals abover parse as:

=>> 2.0//EN" (Section 13.3). Hence, the terminals above parse as:

> HTML
> |
> \-HEAD, BODY
> | |
> \-TITLE \-P
> | |
> | \-<P>,"Some text. ",EM
> | |
> | \-<EM>,"*wow*",</EM>
> \-<TITLE>,"Parsing Example",</TITLE>

Hmmmm, I don't grok this diagram (I know it's supposed to be a parse
tree, but what happened?).

> 2.2.1 Data Characters
>
> Any sequence of characters that do not constitute markup (see
> "Delimiter Recognition," section @@@ of the SGML standard) are
> mapped directly to strings of data characters. Some markup also
> maps to data character strings. Numeric character references also
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> map to single-character strings, via the document character
> set. Each reference to one of the general entities defined in the
^^^^^^^^^^^^^^^^
These need definitions to tie them to the syntax below.

> HTLM DTD also maps to a single-character string.
^^^^
>
> For example,
>
> abc&lt;def => "abc","<","def"
> abc&#60;def => "abc","<","def"
>
> Note that the terminating semicolon is only necessary when the
> character following the reference would otherwise be recognized as
> markup:
>
> abc &lt def => "abc ","<"," def"
> abc &#60 def => "abc ","<"," def"
>
> And note that an ampersand is only recognized as markup when it
> is followed by a letter or number:
>
> abc & lt def => "abc & lt def"
> abc & 60 def => "abc & 60 def"
>
> A useful technique for translating plain text to HTML is to replace
> each '<', '&', and '>' by an entity reference or numeric character
> reference as follows:
>
> ENTITY NUMERIC
> CHARACTER REFERENCE CHAR REf CHARACTER DESCRIPTION
====> ^^^

...

> The length of an attribute value (not the attribute value literal:
> this is the result of stripping the quotes and replacing any
> references).is limited to 1024 characters
^^^
It is hard to tell what you mean here -- which one is referred to as "this"?

> 2.2.5 Comments
>
> To include comments in an HTML document that will be eliminated in
> the mapping to terminals, surround them with "<!--" and
> "-->". After the comment delimiter, all text up to the next
> occurrence of "-->" is ignored. Hence comments cannot be
> nested. White space is allowed between the closing "--" and ">",
> but not between the opening "<!" and "--".

Does this section need something to the effect of "--" is not allowed inside
the comment itself? I.e., to avoid having a true SGML parser barf
on one of <!----->, <!------>, <!-------> (or do they barf at all -- my memory
may be lacking here).

> 3.1 text/html media type
> ...
> Charset
> The charset parameter (as defined in section 7.1.1 of RFC
> 1521 [4]) may be given to specify the encoding used to represent
> the HTML document as a sequence of octets. The default value is
> out of scope of this specification; but for example, it is

==> outside the scope of this specification; but, for example, the
default is

> US-ASCII in the context of MIME mail, and ISO-8850-1 in the
> context of HTTP.
==> ISO-8859-1

> 3.2.1 Conventional Handling of Undeclared Markup Errors
>
> NOTE: To facilitate experimentation and interoperability between
^^^^
nix the note style. If you want to apply the semantics of a note to an
entire section, just precede it by something like the last paragraph (below).

>...
> Information providers should keep in mind that this convention is
> not binding: unspecified behaviour may result, as such markup is
> not conforming to this specification.
>
>
> 3.2.1 Conventional Representation of Newlines and Record Delimiter Characters

Title is too long -- need to shorten (or wrap) it.

> 5.5 Meta
>
> HTML user agent to identify and make use of that metainformation.

==> agents

> 5.6 Nextid

> HTML user agentss may ignore the Nextid element. Support for the

==> agents
[must have been to make up for the missing one above ;-)]

> 6. Data Characters
>
> An HTML user agent should present the body of an HTML document as
> a collection of typeset paragraphs and preformatted text. Except
> for te PRE element, each block structuring element is regarded as
==> the
> a paragraph by taking the data characters in its content and the
> content of its descendent elements, concatenating them, and
> splitting the result into words, separated by space, tab, or
> record end characters (and perhaps hyphen characters). The
> sequence of words is typeset as a paragraph by breaking it into
> lines.
>
> 6.1 The ISO Latin 1 Character Repertiore

==> Repertoire

> Conforming HTML user agents are required to support the US-ASCII
> [10] or ISO-8859-1 [11] character encodings, and the @@fullname ISO
> Latin 1 document character set.

Huh? ISO-8859-1 [11] is the full name for the ISO Latin 1 character set.
According to IANA, the acceptable names for that character set in Internet
documentation include:

Name: ISO_8859-1:1987 [RFC1345,KXS2]
Source: ECMA registry
Alias: iso-ir-100
Alias: ISO_8859-1
Alias: ISO-8859-1
Alias: latin1
Alias: l1
Alias: IBM819
Alias: CP819

and ISO-8859-1 is what is normally used -- the full reference for
the standard is in [11], and does not need to be repeated. Use of the
term "Latin 1" is just confusing the issue.

> The character repertiore shared by these two is known as Latin
==> oire

> Alphabet No. 1, or simply Latin-1. Latin-1 includes characters
> from most Western European languages, as well as a number of
> control characters. Latin-1 also includes a non-breaking space, a
> soft hyphen indicator, 93 graphical characters, 8 unassigned
> characters, and 25 control characters.
..
>
> Each character in the document character set can be written as a
> numeric character reference. This list, sorted numerically, is
> derived from ISO-8859-1 8-bit single-byte coded graphic character
> set:
>
> REFERENCE DESCRIPTION
> &#00; - &#08; Unused
..
This table should have remained in section 13 -- its presence in the
middle of the draft introduces too much clutter into the spec and
makes it difficult to read as a document.

> 7.1 Line Break
>
> <BR> Level 0
>
> The Line Break element specifies a line break in a paragraph or
> preformatted text section. A new line should indent the same as
> that of line- wrapped text.
^^^^

> 12.6 HTML Level 2 DTD
>
> <!ENTITY % HTTP-Method "NAME"

should be returned to

<!ENTITY % HTTP-Method "GET | POST"
or
<!ENTITY % HTTP-Method "GET | POST | PUT | DELETE | LINK | UNLINK"
========
> ...
> <!ENTITY % linkType "NAMES"
> -- a list of these will be specified at a later date -->

The above should be deleted, in favor of changing:

> <!ENTITY % linkExtraAttributes
> "REL %linkType #IMPLIED
> REV %linkType #IMPLIED
> URN CDATA #IMPLIED
> TITLE CDATA #IMPLIED
> METHODS NAMES #IMPLIED
> ">
to
<!ENTITY % linkExtraAttributes
"REL NAMES #IMPLIED
REV NAMES #IMPLIED
URN CDATA #IMPLIED
TITLE CDATA #IMPLIED
METHODS NAMES #IMPLIED
">

Personally, I'd like to delete METHODS as well -- I have never seen
it used at all, and know of no user agent that recognizes it.
But, I don't care much either way on this one.
========

> 14. Security Considerations
>
> Anchors, embedded images, and all other elements which contain URIs
> as parameters may cause the URI to be dereferenced in response to
> user input. In this case, the security considerations of the URI
> specification apply.
>
> Documents may be constructed whose visible contents mislead the
> reader to follow a link to unsuitable or offensive material.

One thing I'd like to see here (eventually) is a note about the
SGML <!NOTATION gif SYSTEM "/bin/rm -r /"> problem which may be a
security hole for net-clueless SGML processors (i.e. SGML systems
that are not aware they are receiving untrusted data). However, I
don't know enough about SGML to accurately describe the problem
(assuming one exists).

....Roy T. Fielding Department of ICS, University of California, Irvine USA
<fielding@ics.uci.edu>
<URL:http://www.ics.uci.edu/dir/grad/Software/fielding>