Re: HTML 2.0 Spec questions

Daniel W. Connolly (connolly@hal.com)
Mon, 21 Nov 94 13:25:33 EST

> I picked up a printed copy of the HTML 2.0 proposed Spec from the Spyglass
>booth and had some questions on it.

I'll Say!

> I thought that you may have time to
>address them before the IETF meeting in December... If these should go to
>the html-wg mailing list, my apologies, I'll resend this.

I took the liberty of resending it for you.

>Ok, I thought there were only a few issues but I decided to go thru my little
>list and it seemed to grow as I typed... If you could address some of the
>larger questions (like how to implicitly decide to terminate a <LI> or
><OPTION>) I would really appreciate it. I hope some of the nits Ive
>indicated are of use in helping polish the spec (which has taken tremendous
>effort to get to this point I think).

OK... first, some general observations:

When I embarked on a clean-up of the HTML spec (WAY back in Jan 1993,
and again in May of this year), I inteded to publish it as (1) a note
that says "HTML is defined in terms of SGML. Go read the SGML spec
first!" (2) the HTML DTD, and (3) a (very) few "application
conventions" like "the href attribute of A should be a URL" (since the
DTD can only say that it should be CDATA).

No suggested rendering, no HTTP interactions, no references to a WWW
browser at all. A nice, tidy, tractible, concise, .... useless document.

Well, I didn't think it was useless: I thought it could be used
to (1) test a document for syntactic validity, and (2) specify
a parse tree for syntacticly valid documents. What you do with
the parsed HTML document, I wanted to leave to other documents,
like a browser spec, an HTTP spec, etc.

But TimBL et. al. vetoed that idea, and insisted that each element
be described in prose with a little "suggested rendering" section,
examples, blah blah blah.

In the absence of a browser spec, I guess this is necessary. I still
find it odious. If fact, I have removed myself from the maintenance of
all that fluff. I maintain the DTD, and I do lots of testing. And I
maintain a validation service, because I find that "try it and see" is
the best way to learn the wierdness of SGML.

I applaud the writers for adding all the introductory and informative
material to make the spec accessible. But this thing has become quite a
beast to maintain and edit. I still maintain that all that fluff should
have been put in a separate document so that the technical content of
the HTML 2.0 spec could have been published back in June like I had
intended. Not that it would have been final -- we'd be on version 2.3
or 2.4 by now.

Your comments come in three flavors, which I'll call editorial (consistency,
phrasing, elaboration, organization), technical (is ___ legal or not?),
and I-wish-it-were-out-of-scope (HTTP interaction).

The editorial stuff I'll leave for the writers.

I'll try to answer the technical questions, but mostly I'll show you
how the validation service* (i.e. sgmls) can be used to answer these
questions.

* http://www.hal.com/%7Econnolly/html-test/service/validation-form.html

I'll hazzard guesses at the out-of-scope stuff. I hope the browser
implementors will chime in.

>First off, there are some places that occasionally imply that a begin and end
>marker for some elements are needed but the DTD (or other parts of the spec)
>implies that only a begin marker is necessary. This can be confusing if you
>don't read the entire spec. For example, under Section 2.6 HTML Forms, the
>OPTION element appears to be defined as requiring the format
><OPTION>...</OPTION> but the DTD in Section 7.1 says that the </OPTION> is
>optional.

2.6 is wrong:

HTML Validation Service Response
********************************

$Id: html-check.pl,v 1.6 1994/11/14 20:55:09 markg Exp $

Check Complete
++++++++++++++

No errors found.

Input
+++++

<TITLE>option test</TITLE>
<form action="junk">
<select name="abc">
<option>1
<option>2
</select>
</form>

Parsed Output (Element Structure Information Set)
+++++++++++++++++++++++++++++++++++++++++++++++++

AVERSION CDATA -//IETF//DTD HTML//EN//2.0
(HTML
(HEAD
(TITLE
-option test
)TITLE
)HEAD
(BODY
AACTION CDATA junk
AMETHOD TOKEN GET
AENCTYPE CDATA application/x-www-form-urlencoded
(FORM
ANAME CDATA abc
ASIZE IMPLIED
AMULTIPLE IMPLIED
(SELECT
ASELECTED IMPLIED
AVALUE IMPLIED
(OPTION
-1
)OPTION
ASELECTED IMPLIED
AVALUE IMPLIED
(OPTION
-2
)OPTION
)SELECT
)FORM
)BODY
)HTML
C

> For the INPUT marker there is a </INPUT> in 2.6 but not in the
>DTD. The example for the paragraph marker under Section 2.2 correctly has
><P>...[</P>] which conforms to the DTD in Section 7.1.

INPUT is an empty element. There is no </input>.

>For the ISINDEX element, is it necessary to 'urlencode' the users input or
>does converting any spaces to plus signs suffice?

I would expect you should ulrencode it. I wish this were outside
the scope of the HTML 2.0 spec.

>What does the anchor element mean if there are no attributes?

Nothing. But as Murray might say, "No harm, no foul."

> If I have no
>hypertext link outbound or inbound, whats its purpose in life?

To illustrate that SGML attribute specifications are limited: you
can't say, in SGML, "at least one of name, href must be specified."
So <a> is legal as far as SGML is concerned, but meaningless as far
as HTML is concerned. This is one of the few "application conventions"
that actually belong in a "document type definition" such as the
HTML spec.

>In section 3.6.3 when discussing the anchor element, a clear definition of
>how the presence of BOTH an HREF and a URN entry should be handled would be
>very useful. If there is both an HREF and a URN, which do you use?? (The
>same could be said for LINK in section 3.5.4).

I'd say try URN first. Again... this would be out of scope, if I had
my druthers.

>Just how do EM and STRONG differ in their intent?

They derive from TeXinfo:

=================
File: texi.info, Node: emph & strong, Next: Smallcaps, Up: Emphasis

`@emph'{TEXT} and `@strong'{TEXT}
---------------------------------

The `@emph' and `@strong' commands are for emphasis; `@strong' is
stronger. In printed output, `@emph' produces *italics* and `@strong'
produces *bold*.

For example,

@quotation
@strong{Caution:} @code{rm * .[^.]*} removes @emph{all}
files in the directory.
@end quotation

produces:

*Caution*: `rm * .[^.]*' removes *all*
files in the directory.

The `@strong' command is seldom used except to mark what is, in
effect, a typographical element, such as the word `Caution' in the
preceding example.

In the Info file, both `@emph' and `@strong' put asterisks around
the text.

*Caution:* Do not use `@emph' or `@strong' with the word `Note';
Info will mistake the combination for a cross reference. Use a
phrase such as *Please note* or *Caution* instead
=================

> The section on
>character-level elements says they should be rendered differently but I don't
>think the distinction between these elements is clear (at least to my
>fatigued mind).

The spec says that EM must be rendered distinctly from plain text,
and that STRONG must be rendered distinctly from plain text. It does
_not_ say that EM must be rendered distinctly from STRONG.

>If DFN is a proposed element (3.8.3), it should be described in more detail
>(ie: what does "the defining instance of a term" mean). Since its not in the
>DTD, a clear description (or at least external reference) would be useful.

Yes... we should explicitly cite TeXinfo.

>For a KBD element, just when is this user text supposed to be obtained? If
>there is an ISINDEX element in the same document then how is the ISINDEX
>input distinguished from the KBD input?
>
>Is there a recommendation how SAMP elements (section 3.8.6) should be
>rendered (like there is for most other elements in section 3.8)?
>
>How is a VAR element to be used? Having a 'variable' is fine but I don't
>grok how its to be used (especially if its classified as an information type
>element). Any examples would be helpful.
>
>Since 'proposed' elements are not listed in the summary sections of 2.x, the
>U element should be removed from 2.4 HTML Highlighting (since STRIKE and DFN
>are not there; just to be consistant).

See the TeXinfo document for answers to all of the above.

>There are no Level classifications for the entries in sections 3.8 or 3.9
>like there are for other elements in prior sections. The classification in
>section 3.7 says Level 2 for some character level elements but the table 2.4
>calls them all Level 1.

See:
http://www.hal.com/%7Econnolly/html-spec/html-pubtext.html
and the links to
L0index.html, L1index.html, L2index.html

for machine-generated listing of the elements at various levels, derived
from the DTD source.

>For the IMG element, does the ALIGN attribute only specify the alignment of
>the image with the text on both sides or just the text that follows it? IE:
>If an image is in the middle of a line with the attribute ALIGN=TOP, should
>all the text on both sides be top aligned or simply the text that follows it
>(ala Mosaic)?

Who cares? Er... "out of scope." Er... Corprew? How does this work
in the real world?

>The description of ISMAP under 3.10 should state that it is an optional
>attribute. Should this attributes description specify the behavior a
>renderer/client should have with this IMG (ie: Describe what is really meant
>by "can navigate transparently from one information resource to another.")

No! Er.. well, until there's a browser spec, yes. The spyglass folks
have more experience building browsers than I do. I nominate them
to come up with the prose.

>Under 3.11.1, Definition Lists, it should be explicitly stated that in
>addition to "Single occurrences of a <DT> tag without a subsequent <DD>
>tag..." that the same can be said of <DD> markers (ie: "Single occurrences of
>a <DD> tag without a preceding <DT> tag are allowed, and have the same
>significance as if the <DT> tag had been present with no text.")

The current content model is (DT|DD)+.

>In general, does a non-character level marker (ie: <P>) automatically signfy
>the end of a DT, DL, or LI??

DT: yes; DD, LI: no (DD, LI can contain P and other %block elts.)

> IE: does
><DL>
><DT>Term1</DT><DD>This is the definition of the first term
><DT>Term2</DT><DD>This is the definition of the second term
></DL>
>get treated the same as say:
><DL>
><DT>Term1</DT><DD>This is the definition of the first term</DD>
><DT>Term2</DT><DD>This is the definition of the second term</DD>
></DL>

Yes:

Input
+++++

<TITLE>dl test</TITLE>
IE: does
<DL>
<DT>Term1</DT><DD>This is the definition of the first term
<DT>Term2</DT><DD>This is the definition of the second term
</DL>
get treated the same as say:
<DL>
<DT>Term1</DT><DD>This is the definition of the first term</DD>
<DT>Term2</DT><DD>This is the definition of the second term</DD>
</DL>

Parsed Output (Element Structure Information Set)
+++++++++++++++++++++++++++++++++++++++++++++++++

AVERSION CDATA -//IETF//DTD HTML//EN//2.0
(HTML
(HEAD
(TITLE
-dl test
)TITLE
)HEAD
(BODY
-IE: does\n
ACOMPACT IMPLIED
(DL
(DT
-Term1
)DT
(DD
-This is the definition of the first term
)DD
(DT
-Term2
)DT
(DD
-This is the definition of the second term
)DD
)DL
-\nget treated the same as say:\n
ACOMPACT IMPLIED
(DL
(DT
-Term1
)DT
(DD
-This is the definition of the first term
)DD
(DT
-Term2
)DT
(DD
-This is the definition of the second term
)DD
)DL
-\n\n
)BODY
)HTML
C

>Should the description of the Paragraph marker in 3.12.1 explicity state that
>all multiple occurrances of user inserted whitespace (ie
>"<H1>Hello</H1><P><P><P><P>There is a big gap between this and the header"
>would be rendered as if only one <P> were in the HTML stream??)?

Out of scope. Er..
I'm pretty sure that would be inaccurate w.r.t. current practice.

> Also,
>shouldn't the description of <P> urge authors to enclose the paragraph in
><P>...</P> rather than passively encouraging a variation of the older
>"...text<P>"? If the spec encouraged the use of <P>...</P> then it is more
>likely to be adopted in common use.

The examples should be updated to encourage
<p> ....

There's no reason to encourage folks to add </p> to their documents.
It's _always_ redundant.

>Given the FORMs example snippet of "<P ALIGN=CENTER><INPUT TYPE=SUBMIT><INPUT
>TYPE=RESET></FORM>", just how should the controls/text be rendered (where
>does the paragraph alignment automagically stop, at the next <P> element or
>at the </FORM> element?) I realize that the rendering rules are not
>strictly part of the HTML DTD but what I think would be extremely helpful is
>some rules / guidelines on when elements that do not REQUIRE their end
>markers (ie <P>, <LI>, <DT>, <OPTION>, etc) lose their 'effectiveness' (or
>'control' or 'focus', whatever you want to call it).

I hate this. I hate it when folks get confused about the distinction
between SGML parsing of an HTML document and the subsequent
rendering/processing. I wish this document covered 100% of the former,
and 0% of the latter, just so folks wouldn't get confused. SGML parsing
is a science. It's a wierd science, but it's well specified and understood
by a certain community. HTML rendering is an art; it cannot and should
not be specified.

Please use the validation service and the various references generated
from the DTD to get a feel for how SGML tag inference works.

> One example could be "a
><LI> element does not require a corresponding </LI> element but if another
><LI> is encountered before a </LI> then the render should consider there to
>be an implicit preceding </LI>."

These rules are in the SGML spec, ISO8879. The TEI project publishes
a "Gentle Introduction to SGML" that seems to cover these issues to
some folks' satisfaction. Perhaps the HTML spec should cite that work.

It would be a painful process to reproduce the rules in the HTML spec
(painful because you'd want to omit any complications introduced by
variant syntaxes, CONCUR, RS/RE processing, etc.).

>In a form where there are both SUBMIT and RESET buttons, what should the
>behavior of the render when Enter is pressed? Would it be appropriate to
>default to SUBMIT (or RESET)?

Out of scope... Er.. sure. Sounds good.

> Would an attribute/value of "DEFAULT" for
>either be a useful way to set this??

DEFINITELY out of scope. We're _not_ introducing any features at
this point.

>Under the FORM description in 3.13.3, there is mention of encoding
>(application/x-www-form-urlencoded) but there is no description of this
>encoding anywhere in the HTML 2.0 spec (that I can find). This encoding
>should be explicitly described in this section under the ENCTYPE attribute
>(or a specific reference given since this is probably beyond the HTML spec
>and in the HTTP spec).

Now you're catching on!

>All INPUT attributes should be labeled as optional since according to the DTD
>none are required. JOOC, what would an INPUT field with no attributes be
>used for and what would be an example of how it should be rendered??

Hmmm... isn't the NAME attribute required? Sure enough... I wonder
if this should be changed???

>Given the following snippets, what would be the name/value pairs to be sent
>when the user selected the Submit button (there is no VALUE entry for the
>second example below)?
><INPUT NAME="Foo" VALUE="Yes" TYPE=SUBMIT>
><INPUT NAME="Bar" TYPE=SUBMIT>

Out of scope. Err...

I dunno. In these cases, I usually ask Corprew Reed to do some experiments
and report what Mosaic, WinWeb, MacMosaic, Netscape, etc. do. If they're
consistent, we write it up and call it "current practice." If not,
we say it's unspecified.

>I recall reading in some HTML spec somewhere that multiple whitespace would
>be automatically compressed into one. In the description of Space, 3.14.1,
>there is no mention of this apart from the fact that Space is a special
>character. Has this 'compression' feature been removed?

In a way... it's no longer specified.

>In Section 7.1, the Level 2 Recommended DTD, the Required Parts entry for
>BLOCKQUOTE simply has the begin and end marker back to back without any '...'
>or 'characters...' between them (like other elements do). Is there a reason
>for this or was it accidental?

The reason is that in the Recommended version of the DTD, Blockquote
is supposed to contain other elements, not raw data. You're
supposed to write:

<Blockquote>
<p>text
<ul><li>and<li>a<li>list</ul>
</blockquote>
rather than:
<Blockquote>
text
<ul><li>and<li>a<li>list</ul>
</blockquote>

The idea is that mixing block elements and character-level elements
is "messy."

>In Section 7.1, there is no DTD entry for Underline. Was this simply because
>Mosaic can't handle it or was it depreicated in the HTML spec?

Some browsers use underlining as a way of indicating links. Hence,
if you tried to use underlining for something else, you'd lose. So
we thought it would be a bad idea to have an underline element.

Dan