Re: comments on the DTD in Nov 16 draft

Daniel W. Connolly (connolly@hal.com)
Tue, 22 Nov 94 09:03:21 EST

In message <9411211757.AA25193@texcel.no.texcel.no>, Paul Grosso writes:
>section 3.4.3, page 15:
>- very first line, I don't understand the need for this use of &quot;
> [and I would really dislike the use of &#34; as done in the DTD] when
> one can use LITA (that is, a single quote) as the delimiter. While
> I'm not opposed to pointing out the possible use of &quot; (though
> I would argue against recommending &#34;)

Why would you argue against recommending &#34;? Just curious.

>, I'd like to see the option
> of using single quotes pointed out.

I think I gave a replacement blurb that did just that in an earlier
message.

> NOTE: Unless you use the minimized syntax, some implementations
> won't understand.
> This doesn't make clear to me which form is "minimized." Assuming
> strict SGML terminology (where 'minimized' is the opposite of 'minimal'),

Wheee! Isn't SGML fun?

> And, if this is the case, it's too bad, because some SGML tools
> only understand (and even more--the great majority, in fact--only produce)
> minimal aka normalized form.

Yes, it is too bad. I expect that these bugs will be fixed in
upcoming releases of browsers. But that doesn't mean they're not
there. These 'NOTE:' thingies are just warnings to information
providers.

>section 3.4.4, page 15:
>- first paragraph says "HTML generators should generate strictly conforming
> HTML." How/where is this defined--I mean above and beyond the DTD? In
> particular, what about my previous point? Is an SGML editor that produces
> normalized SGML (and therefore does NOT produce minimized syntax for
> attributes such as UL's COMPACT) generating strictly conforming HTML?

Yes. There's a blurb somewhere that says that HTML is an application
of SGML, and that in the case of an apparent conflict, SGML is definitive.

>section 3.12.2, page 30:
>- in the item about line boundaries, if the statement is meant to match
> what SGML defines, I would recommend amending:
> * Line boundaries within the text are rendered as a move to
> the beginning of the next line, except for one immediately
> following or immediately preceding a tag.
> to
> * Line boundaries within the text are rendered as a move to
> the beginning of the next line, except for one immediately
> following a start tag or immediately preceding an end tag.
> [record end handling in SGML is potentially more complex, but my
> suggested modification should make things practically (and I mean
> that in both senses) accurate.]

Bzzt. The HTML spec punts on all the phase-of-the-moon record end
handling:

From: the SGML declaration of HTML, aka
http://www.hal.com/users/connolly/html-spec/html.decl

FUNCTION
-- SPACE 32
TAB SEPCHAR 9
LF SEPCHAR 10
FF SEPCHAR 12
CR SEPCHAR 13 --

-- The above is an accurate description of the usage of FUNCTION --
-- characters in HTML implementations; that is, there is no --
-- Record Start or Record End character, and no occurences of --
-- character 10 or 13 are "ignored" by the parser. --
-- But because few SGML implementations support this concrete --
-- sytax, we include the one below. --

-- Note that in order to get correct behaviour w.r.t. newline --
-- processing, you will have to play some tricks in construcing --
-- the document entity for parsing in order to keep the parser --
-- from ignoring newlines in surpirsing ways --

RE 13
RS 10
SPACE 32
TAB SEPCHAR 9

>section 6.2.1, page 61:
>- This whole section scares me a bit. As I wrote elsewhere, I'd rather
> just reference the ISO set. If we want to publish the byte numbers
> in the HTML spec that may be used by some browsers, we can do that,
> but that's just a question of display-tool-dependent encoding of
> the standard ISO character entities. And other tools may not need
> or want to use (or be able to use) that particular encoding.
> The whole point is that authoring/editing tools should write HTML
> documents--and browsers should read/process HTML documents--using
> the character entity references defined in the ISO character set
> such as &Aacute;. I don't see it as part of the definition of HTML
> to tell tools what potentially device-dependent replacement text they
> must use.

The point is that it's _not_ device dependent. Those are not "the byte
nubers used by some browsers." The document character set of HTML is
ISO8859-1, ISO Latin 1. The &szlig; markup refers to exactly that
character in ISO8959-1, not to any device dependent thingy.

Case in point: More than once, I have seen HTML-to-TeX converters
that convert &ouml; to \ouml or some such TeX markup, but the implementor
neglected to convert &#246; (or the signle character at position 246)
to \ouml. This is wrong. The correct way to do this sort of conversion
is to let the entity manager and parser reduce all three representations:
* &ouml;
* &#246;
* _ (the single character 246)

to the last one. Then, write code to convert an arbitrary sequence
of ISO8859-1 characters to TeX markup.

Or on a display system like a PC or Mac, which, unlike X, doesn't
use the ISO8859-1 character set (encoding, if you like) much. You
reduce the &ouml; and &#246; markup to a single character in the
parser/entity manager, and then you map ISO8859-1 to the local
encoding.

>- If we are going to create our own entity set, we cannot include the
> ISO copyright. I recommend we also do not use %ISOlat1; as the
> example entity name.

This part I'm still not clear on. Suggestions are welcome.

Dan