Re: New draft: charset, conformance cleanup

Dan Connolly (connolly@w3.org)
Mon, 3 Apr 95 08:51:08 EDT

Roy T. Fielding writes:
> Boy, Cynthia is going to be pissed at us. ;-)

Who is cynthia, an why?

> > attribute
> > A name/value pair: part of an element which is often used
> > to specify a characteristic quality of the element, other than
> > type or content.
>
> Is that sufficient to cover minimized attributes (e.g. <UL COMPACT>)?

Yes: <UL compact> is just short for <ul compact="compact">, so it still
specifies a name and a value.

> > minimally conforming HTML user agent
> > A user agent that conforms to this specification in its
> > treatment of the Internet Media Type "text/html; level=0;
> > version=2.0"
>
> Is this used? Is it necessary?

If this is not necessary, then none of the stuff about levels is
necessary. That may be the case, but I'm not interested in making
that change today.

> > 2. HTML as an Application of SGML
> > If this specification and the SGML standard conflict,
> > the SGML standard is definitive.
>
> As I've said, this statement is not appropriate.

So stricken. Just so we all understand: HTML is a SGML application,
and we're not in conflict with ISO8879, to the best of our knowledge.

> If you want to say something like this, the most you can say is:
>
> If the description of SGML presented in this specification conflicts
> with the SGML standard as represented by ISO 8879:1986 [12], then
> the SGML standard is definitive.

I'd be willing to put this in, if anybody wants it there.

> > HTML
> > |
> > \-HEAD, BODY
>
>
> Hmmmm, I don't grok this diagram (I know it's supposed to be a parse
> tree, but what happened?).

Hmmm... maybe some tabs got lost in an ASCII mode ftp transfer
that I did...

> > 2.2.1 Data Characters
> >
> > Any sequence of characters that do not constitute markup (see
> > "Delimiter Recognition," section @@@ of the SGML standard) are
> > mapped directly to strings of data characters. Some markup also
> > maps to data character strings. Numeric character references also
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > map to single-character strings, via the document character
> > set. Each reference to one of the general entities defined in the
> ^^^^^^^^^^^^^^^^
> These need definitions to tie them to the syntax below.

Hmmm... how many definitions should I copy from the SGML standard?
(and should I go back and change the definitions of document, element,
attribute, entity, ... to match ISO8879?) I could separate the
definitions into those introduced by this spec, and those borrowed
from normative references. Would that be useful?

> > The length of an attribute value (not the attribute value literal:
> > this is the result of stripping the quotes and replacing any
> > references).is limited to 1024 characters
> ^^^
> It is hard to tell what you mean here -- which one is referred to as "this"?

Since this is informative only, and HTML attributes generally get
smaller rather than larger when you replace references, I changed it
to:

Note that the SGML declaration in section 13.3 limits the length of
an attribute value to 1024 characters.

> > 2.2.5 Comments
> >
> Does this section need something to the effect of "--" is not allowed inside
> the comment itself? I.e., to avoid having a true SGML parser barf
> on one of <!----->, <!------>, <!-------> (or do they barf at all -- my memory
> may be lacking here).

I refer you to:
Re: <!-- comments -->
Daniel W. Connolly (connolly@hal.com)
Fri, 20 Jan 95 19:18:22 EST
http://www.acl.lanl.gov/HTML_WG/html-wg-95q1.messages/0189.html

For readers of the published spec, section 2 starts out with the
following apology:

A complete discussion of the mapping of a sequence of characters to
a sequence of tags and data is left to the SGML standard. This
section is only a summary.

> > Conforming HTML user agents are required to support the US-ASCII
> > [10] or ISO-8859-1 [11] character encodings, and the @@fullname ISO
> > Latin 1 document character set.
>
> Huh? ISO-8859-1 [11] is the full name for the ISO Latin 1 character set.

The MIME thingy called ISO-8859-1 is a character encoding. User agents must also support ISO Latin 1 as a document character set. The only complete
name I know for ISOLatin1 as a document character set is:

BASESET "ISO 646:1983//CHARSET
International Reference Version
(IRV)//ESC 2/5 4/0"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
BASESET "ISO Registration Number 100//CHARSET
ECMA-94 Right Part of
Latin Alphabet Nr. 1//ESC 2/13 4/1"

DESCSET 128 32 UNUSED
160 96 32

> >
> > REFERENCE DESCRIPTION
> > &#00; - &#08; Unused
> ...
> This table should have remained in section 13 -- its presence in the
> middle of the draft introduces too much clutter into the spec and
> makes it difficult to read as a document.

OK, but I hand to change some section names, since numeric character
references have nothing to do with entity sets.

> One thing I'd like to see here (eventually) is a note about the
> SGML <!NOTATION gif SYSTEM "/bin/rm -r /"> problem which may be a
> security hole for net-clueless SGML processors (i.e. SGML systems
> that are not aware they are receiving untrusted data). However, I
> don't know enough about SGML to accurately describe the problem
> (assuming one exists).

Hmmm... this reminds me: the spec should make it clear that HTML user
agents are not required to support an internal declaration subset, and
hence <!NOTATION ...> and <!ENTITY ...> declarations should not appear
in "text/html; version=2.0" message entities.

For user agents that do support an internal declaration subset,
this is a very real security concern.

[I followed the remaining editorial suggestions, and I won't quote them here.]

Daniel W. Connolly "We believe in the interconnectedness of all things"
Research Technical Staff, MIT/W3C
<connolly@w3.org> http://www.w3.org/hypertext/WWW/People/Connolly