Re: HTML todo list

Dan Connolly <connolly@pixel.convex.com>
Message-id: <9301142323.AA26924@pixel.convex.com>
To: timbl@nxoc01.cern.ch
Cc: www-talk@nxoc01.cern.ch
Subject: Re: HTML todo list 
In-reply-to: Your message of "Thu, 14 Jan 93 18:02:07 +0100."
             <9301141702.AA00591@www3.cern.ch> 
Date: Thu, 14 Jan 93 17:23:44 CST
From: Dan Connolly <connolly@pixel.convex.com>

Issues 1-10 are resolved to my satisfaction.

>> 11. This text seems out of place:
>OK I have hidden it. :-) Does your spec say it anywhere?

Well, it's mentioned in tolerated.html, but that's hardly
the place to go looking for explanations. I suppose we
need a section telling implementors how to handle errors
or something.

12-16: OK.

>> 17. Should the TITLE element be CDATA, RCDATA, or PCDATA?
>> If we want to be able to use Latin chars in the title,
>> it can't be CDATA. The only difference between RCDATA
>> and PCDATA (with no subelements allowed) is that comments
>> are recognized in PCDATA, whereas they are just regular
>> data in RCDATA.
>
>Good point.
>
> - If we specify Latin 1 as the base set, can't wehave latin 1
>   characters in CDATA?
>
> - If we can't, then I guess we use PCADATA as it would be the
>   only place except for <XMP> and <LISTING> where we can use
>   RCDATA.
>   

I agree that CDATA is closest to the original intent. And yes,
if we include Latin 1 chars in the document character set, we
can put them in CDATA, but they'd have to be actual 8 bit
characters, and not numeric character references.

Furthermore, it presents a problem for HTML writers. It's convenient
to treat all data characters the same way, that is, replace <, >, &
with character references. If we make the TITLE CDATA, then HTML
writers must special case this element, and not do the replacement.
Furthermore, they're out of luck if they want to write "</x" where
x is any letter.

On the flip side, if they do replace <>& with references, old parsers
probably won't grok. But I think that's a small price to pay: titles
with <>& in them will look ugly on old implementations. No big deal.

So, in the end, I vote for RCDATA.

This brings me to issue
79. How do we maintain the DTD?
The annotated DTD is a great idea. But we need to be sure the
actual SGML code matches the hypertext version. It seems we can
derive the SGML from the hypertext with:
% www -n -source HTML.dtd.html >HTML.dtd
Then we just have to maintain the hypertext version. I'll do some
testing and get back to you.

Also, 
80. The hypertext version of the DTD has some non-PRE parts. I suspect
a NeXT browser bug.

18: OK, but the discussion relates to issue 50.

>Yes.  OK.  But I want as I said before (unless the crash lost the
>message) to have two documents out of this. One is the HTML spec for  
>MIME IANA registration.

I disagree. I still think the root of the HTML documentation should
look like:

<H1>HyperText Markup Language</H1>
<H3>Abstract</H3>...
<H2>Language Reference</H2>
        <A>Text and Markup</A>
        <A>The Elements</A>
        <A>Implementors' Guide</A>
<H2>Specification</H2>
        <A>the DTD</A>
<H2>Appendices</H2>
        <A>futures</A>
        <A>constraints</A>

> The other is a readable document which is  
>NOT 100% a precise refernce document but can be read by human beings  
>WITHOUT SGML knowledge.  I can guess that this document will have 10  
>times the readership of the other if it is readable, as <10% of the  
>people creating HTML will know about SGML CROs etc etc.

I think this is madness. As I said before: it is exactly this group of
people, those writing HTML, that _must_ be aware of SGML syntax. The
"Text and Markup" document is about two pages of fairly readable
prose, in my opinion. If it's too much for folks to digest, let's do
something about that. But let's not create a priesthood here!

The whole point of my efforts is to make the HTML format
understandable by everyone who needs to understand it, while
maintaining SGML compliance.


>> In http://info.cern.ch/hypertext/WWW/MarkUp/Elements/NEXTID.html
>> 19. The status of each element should be noted consistently. e.g.

We already have some understanding of what the status is. What I'm
after is a way to look up the status of an element by name. This
relates to other reference needs too:

81. We need an alphabetical index of the HTML elements. Perhaps we
should reorganize the DTD so they appear in order there? I thought
http://info.cern.ch/hypertext/WWW/MarkUp/Tags.html
was going to be an alphabetical element reference. But I see the
need for the present organization. Still, I think we need another
element reference node.

back to
58. Get rid of NEXTID?
>I have made NEXTID Mainstream.  Editors need it: can't do without it
>really.

Hardly. It's a redundant optimization only: the editor need simply
scan all the anchor names, choose the highest, and add one.

Besides: it can be wrong. Suppose you create a document on the NeXT
and save it with NEXTID N=27. Then I edit it by hand and add anchor
names 27, 28, 29, and 30, but I forget to change NEXTID. Then
when you load it back into the NeXT editor, we'll have problems.

82. Do we need a place to put the locking cookie?
>We also need a hook for a version for the checkin/out/lock logic  
>DAN(?) proposed.  That was that when you
>lock or PUT a document, you specify the version so that a document  
>can be PUT or CHECKED IN by a different person to the one who GoT it.
>This means the server gives a key, a version or date code, with the  
>document. This is all HTTP2 except when a document is stored  
>somehwre, passed around and then eventually returned to the server.  
>In that case, it needs a place to hold its original version number
>on the server.
>
><EDITING NEXTID=z27 CHECKEDOUTAS="19930217234507">
>
>Thoughts?

Is
http://info.cern.ch/hypertext/WWW/Protocols/HTTP/HTTP2.html
the latest info on HTTP2? I don't see how the server gives the
cookie to the client on a CHECKOUT response. If the cookie goes
inside the HTML, then locking is limited to HTML format transfers.

I suspect that in stead, the cookie will be part of the headers
of the response. Then it would be the client's responsibility
to keep the cookie associated with the document content. It
might be handy, though, for the client to be able to tuck the
cookie inside the HTML, if the document is indeed HTML.

>> http://info.cern.ch/hypertext/WWW/MarkUp/Elements/LINK.html
>> 20. How many of these are allowed? I could change
>Any non-negative integer
>> ... <!ELEMENT HEAD - -  (TITLE? & ISINDEX? & NEXTID? & LINK*)>
>> I don't know if the latter is legal SGML. I'd have to try
>> it out.
>I think that's what we want.

SGMLs says Okie Dokie. So this is resolved, pending a DTD update.


>> 21. Link types are not well defined. The only reason to put
>> something in a public specification is so everybody can agree
>> on some semantics. If there are no semantics to agree on,
>> why include the TYPE attribute? (It's status is at best
>> "proposed" in my mind, though it's in the DTD.)
>
>
>Yes and no.  We need some well-define link type but we also need this  
>as a hook for the future which we haven't enugh experience.  Link
>types whould be registered.
>
>This is a flexibility point, but it must be firm ... like
>a towing ball on the back of your pickup you want to be able
>to connect anything onto it but you want it well fixed onto the  
>truck!
>
>But I want to make it REL instead of TYPE as people think TYPE
>refers to the object type of the desdtination object rather than the  
>link.  (From messages on this list).

Ok. Now we have _some_ semantics to agree on. I agree the list
should be extensible, but until there was some list to start with,
I didn't see sufficient motivation to put it in the spec.

On the REL/TYPE name issue: This brings up issue
73. Link types: we should look at HyTime before we go much further
on this.

The REL/TYPE attribute is very close to the HyTime "anchroles"
attribute. See below for more on issue 73.

In http://info.cern.ch/hypertext/WWW/MarkUp/Elements/A.html
83. "The type is expressed by a string for extensibility."
I think it's a NAME, rather than CDATA in the DTD. So it's
not an arbitrary string -- it has to start with a letter etc.

>> In http://info.cern.ch/hypertext/WWW/MarkUp/Headings.html
>> 22. "(at least six)" -- how about exactly six? Though I've
>> seen a lot of style guides that frown on anything more than 4.
>
>I agree.  I wuld frown ony anything over 3 in a hypertext document.
>However, it is useful to generate a great big HTML document by
>concatenating little ones, demoting their heading levels. You then  
>print the big document. This generates up to 6 easily.  Maybe we  
>should go to 9 but frown on >4.

I agree with frowing on anything >4, but supporting up to 9 just
for this one application (concatenate, demote, print) seems like
a bad idea. I don't feel strongly one way or the other, though.

23: OK.

>> 24. In the Archive section, we could metion comp.text.sgml,
>> the SGMLs parser materials, and the ifi.uio.no archive.
>
>Link put in cruely.

Here's a reference to the SGMLs materials:
ftp://ifi.uio.no/pub/SGML/SGMLS/

>> In http://info.cern.ch/hypertext/WWW/MarkUp/Elements/A.html
>> 25. All attribute values have to be quoted, including NAME.
>> The example is wrong.
>
>I have cahnged NAME to ne a NAME -- ie doc-wide unique which it must   
>be. Numberic ones are then not valid but I donb't generate them any  
>more.  I think that we should stick to the intended ID system. In the  
>future, we can think about IDs on many other elements.

SGML considers omitting the quotes around an attribute value to
be a form of markup minimization. In the HTML DTD, I have turned
off all the minimization features, so ALL ATTRIBUTE VALUES MUST
BE QUOTED, regardless of whether they're CDATA, NAMEs, IDs, etc.

Now we have another issue:
84. Should anchor names by SGML IDs?
They're NMTOKENs in the current DTD. This restricts the syntax
to a string of 34 or fewer name characters. As for semantics,
the SGML parser will uppercase the string before handing it to
the application, but it won't check it for uniqueness.

Making this a NAME attribute would restrict the syntax further
so that it must begin with a letter.

Making this an ID attribute would give it the same syntax as NAME,
but the parser would also verify document-wide uniqueness.

The NMTOKEN approach was descriptive, whereas the ID approach
is somewhat prescriptive: it conflicts with some old HTML. I'm
all for moving to ID, though. I think we should take advantage
of SGML features wherever possible.


26-28: OK (but issue 73 is still open)

29: OK

>> 30. Where are P's allowed? In the DTD, they're allowed in:
>> HTML, BODY, ADDRESS, BLOCKQUOTE, PRE but not in HEAD, A,
>> CODE, SAMP, etc.
>
>That's right.  They are not in the CERN implementations allowed in  
><DL> or <UL> etc, but they would be useful in those.
>Comments?

Can we do the following in stead of using paragraphs in DL?
<DL>
<DT>large
<DT>big
<DD>Large and big are synonyms.
<DD>They share a definition.
<DD>This is the third paragraph of explanation about big
and large things.
</DL>

It appears that the linemode browser generally groks. Thus, let's
not mess with P's in DL's. (P's in UL's are messy enough to leave
out too, I guess.)

31-32: OK

>> 33. What does this mean?
>> The opening list tag  must be immediately
>> followed by the first list element.
>
>(LI | (A|%text)+)  in SGML I suppose just as you say.
>You can't
>	<UL>and here they all are:
>	<LI>The first..
>	<LI>the second
>	</UL>

I kinda thought LI's in UL's were just like P's in the BODY:
they separate, rather than terminate or begin, items.

But if you prefer/require the added structure, I'm all for it.

>> 34. The important difference between UL, MENU, and DIR is not
>> how they are displayed, but their semantic meanings. A MENU
>> is a list of things to choose from. A DIR is a list of names
>> in a directory.
>
>Yes and no.  I too like logical definitions -- I am sold on semantic  
>markup but HTML is to cover a vast range of data and semantics. MENU
>These things are NOT necessarily what their names suggest -- many a  
>selectable menu is set out as a DIR or a DL. The element names are
>mnemonic only.  The blurb talks about how much text is in the  
>paragraphs.

What the blurb talks about is exactly what I object to: defining
elements by their formatting. If they have no semantic difference, why
not just have one list element with an advisory WIDTH or COLUMNS
attribute?

In fact, I'd support more structure in these things. I suggest
that a MENU should be exactly a list of anchors -- same with DIR,
while UL is a list of free form text, possibly including anchors.

And the difference between MENU and DIR should not be defined in
terms of a number of characters. I'd explain it by saying that the
items in a DIR are likely to be short, mnemonic names that require
significant context for the user to choose between them, whereas MENU
items should be more verbose and may not require explanation or
context.

I see that you've edited the List document somewhat. It's getting
better, but there are some bugs too. Some sections that look like
they should be PRE are not (see #80).

In http://info.cern.ch/hypertext/WWW/MarkUp/Lists.html

85. The suggested syntax for COMPACT uses minimization, which has
been turned of in the HTML SGML declaration.

The markup
<DL COMPACT>
could be used if we had SHORTTAG (or maybe this is OMITTAG) yes,
but we don't. It's short for
<DL someattributename="COMPACT">
Some legal alternatives are:
<DL STYLE="COMPACT">
<DL COMPACT="yes">

35-37: OK

>> 38. Semantics of newlines in PRE. Given the current DTD, a newline
>> after the PRE start tag or before the PRE end tag is not reported
>> by an SGML parser.
>> 
>
>> I think I can cook up some magic SHORTREF declarations that will
>> cause the SGML parser to report the newlines (possibly as P tags).
>> [This would obviate the need for special newline processing code
>> in libHTML]
>> 
>
>> In any case, I'd suggest that ALL NEWLINES REPORTED BY THE SGML
>> PARSER IN THE PRE ELEMENT BE DISPLAYED AS LINE BREAKS. That only
>> leaves the issue of which newlines are reported, which is governed
>> by the SGML standard.
>
>... and with the issue of explaining the end result to the
>simple HTML writer and to me without our needing to call on the
>model of the SGML engine and application. Awaiting the results
>of your tests with SHORTREF.

Geez... you're really dragging your feet whenever it comes to
embracing SGML, aren't you? I think the number 1 issue is conformance,
with ease of documentation secondary. If we come up with a conforming
solution, folks can look at our documentation, SGML books, post to
comp.text.sgml, etc. for clarification.

On the other hand, if we cook up some shortcut (like "Line boundaries
within the text are
rendered as a move to the beginning
of the next line, except for one
immediately following or immediately
preceding a tag" which is not exactly what SGML reports, believe me),
a) folks using SGML compliant software will have to jump through
hoops to support our stuff, and b) our documentation becomes the
only place to look for guidance.

Back to the technical issues...

Newlines are also an issue in A elements. My version of the MidasWWW
browser sometimes displays an extra space in the case of:

some text with an <a
HREF="foo">
anchor</a> in it.

It displays one for the space after "an", and one for the newline,
since it doesn't remember whether it's in a word or not across
elements.

SGMLs would eat the newline after the A start tag. If I changed
libHTML to be in conformance on this issue, that would fix the
MidasWWW bug. I'm beginning to think that the way SGML handles
newlines by default makes some sense. We do need a cogent
explanation, though.

39: OK

>> 40. "... character character highlighing elements may be used."
>> Ack! I don't recommend this! Hmmm... maybe only the B, I, and U
>> elements. This certainly conflicts with the current DTD.
>
>Serious point here folks.  There was a great demand for B I U
>for man pages and the like. Why prohibit anything other than TT.
>or to keep it simple, allow anything and mention TT should not be  
>used, and the constraints of fixed width may limit the ability to  
>render some highlighting.

Highlighting elements like CODE, SAMP, etc. do not mandate a
particular formatting style. One shop may put SAMP sections
in a fixed width font, where another may just use a sans-serif
font to show the distinciton. Some shops may not make any
typographic distinction at all.

The PRE element is a complete escape from all this. It's an
element that says "here's what it looks like. The semantic
information has already been rendered into formatted text."
So while B, I, and U still have a place, CODE, SAMP, etc.
do not belong.


>I have introduced %htext noting that text always occurred with A.
>I hope I have done it right.

If you had built sgmls, you could know for sure :-) Yes, it
looks right to me, though I haven't done the translation
from hypertext to SGML code to check.

41-43: OK

>> 44. Examples (TBD) see complete.html in my stuff.
>
>I repeat that I like your examples but I would like them split
>into GOOD HTML documents describing bad HTML documents,
>with links to the bad documents for testing only.
>We don't want people to follow links to the only documentation to  
>find their parser has core dumped :-)

complete.html is GOOD HTML. But it lacks explanation.  I only meant to
suggest that the author of the Highlighting examples consult
complete.html, not that readers should be referred there.

45: OK

>> 46. "The text may contain any ISO Latin printable characters" --

OK, pending DTD update.

47-48: OK

>> In http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html
>> 49. "Special characters are represented
>> by SGML entities"
>> They're represented by numeric character references.
>> The lt, gt, and amp entities are not in the DTD. They should
>> be supported for historical reasons, but they are obsolete.
>I would like them in the DTD. While people are still reading/writing
>HTML they are useful. My mental ASCII table is in hex, not decimal,  
>anyway.  Are they any overhead? Why the war against them? For the ISO
>characters you wanted the opposite.  (Does your menatl ASCII table  
>stop at 128? Mine too)
>
>Comments?

For the ISO characters, I wanted to use names because I wanted to
treat them as external (i.e. non-SGML text) data entities. Since <, >,
and & have representations without resorting to entity processing,
that's the way I wanted to go. We would have had numeric character
references, handled by the parser, and external data entities, handled
by the application, but no text entities (which are handled by the
parser). It simplified the parser and the parser/application interface.

But since we're going to include ISO characters in the HTML
character set, but I still want to be able use their names,
we'll need to support text entities. As long as we're supporting
text entities, we might as well keep &lt;, &gt;, and &amp;.

>> 50. I'd like to move...

Whoa... no response at all? This was one of the major issues.
What does anybody else think of reorganizing this stuff?

51-57: OK, pending DTD update.

>> 58. Get rid of NEXTID element?
> Nope .. needed to stop editors reusing deleted IDs. See above.

I still say this doesn't belong in the standard. You can continue
to use it locally, Tim, but I don't think it's a generally
sound mechanism.

>> 59. Document URN, TITLE, METHODS attributes of A element.
>Ooo yes. Done. Lots of "notes" attached for info only.

Almost done. METHODS is not a comma separated list. The SGML
attribute type NAMES is one or more names separated by whitespace.

>> 60. Proposed Headers element (like a DL; for RFC822 message  
>let's put COMPACT as an attribute for DL and leave the HEADERs if you  
>don't mind too much.

Okie dokie.

>> 61. List STYLE attribute?
>No I don't think so -- see discussion #60

>> 62. XMP and LISTING: CDATA or RCDATA?
>CDATA is probably nearest to the original intention?

See 17. If we're going to have CDATA anywhere, it belongs
here. But I'd rather avoid it.

63-54: these are mine.

>> In http://info.cern.ch/hypertext/WWW/Test/test.html
>> 65. This node should be moved to the implementors' guide.
>Same coments as above -- moved in <PRE>

I don't follow you at all here. I think this relates closely
to #50.

66-70: OK

71: OK.

>> 72. Comments: the comment element is a bad idea. SGML comments are
>> documented and supported.
>
>They are rather different in that a comment can surround a whole
>nested stack of SGML elements, and could ne nested. I don't suppose  
>SGML comments can?

No, SGML comments can't be nested, but they could "surround a whole
nested stack of SGML elements." I'm prepared to debate the comment
element further, but shall we suffice it to say that it's obsolete?
It's listed as Deprecated in the web. This means it should be
supported by future implementations. I heartily disagree.

>> 73. Link types: we should look at HyTime before we go much further
>> on this.
>Well, there is only 9 pages on hhypertext in HyTime (More Time than  
>Hy) and in that I can't see any mention of link types.  As I said  
>above (with a different metaphor), I think this should be a well  
>defined and entrenched gate into uncharted terriory

The HyTime ilink architectural form has an attribute called "anchrole"
that has almost exactly the same semantics as the HTML anchor TYPE/REL
attribute, except that the HyTime anchrole has one name for each end
of the link, rather than a single name for the relationship between
the two ends. This obviates the need for the REV attribute: just
list the roles in the opposite order.

I'd like to get some HyTime folks to review the WWW data model. I
think we could get Elliot Kimber and/or Steve Newcomb to take a look
by posting to comp.text.sgml. I think that's just what I'll do.


74-77:
>> In the midaswww-1.0 browser: [by the way: I've fixed all these in  
>my copy]
>Could you post diffs please for those Dan? Thanks.

Tony and I have to sync up on all this. He's the author, and I
won't release stuff without his OK.

78: Not an HTML issue.

86. SAVEDAS
> Now, what about the SAVEDAS adddress so that from justthe content of  
>the document hte partial UDIs can be resolved? I think that is a  
>useful thing, and could be essentail. I will put that in as Standard.

Yuk, but OK for now. (I have lots to say about URLs, but until I
have time to do something about it, I'll leave them be :-)

Dan