Re: New DTD (final version?)

Daniel W. Connolly (connolly@hal.com)
Wed, 8 Feb 95 02:16:08 EST

In message <9501301101.AA07367@texcel.no.texcel.no>, Paul Grosso writes:
>> From: "Daniel W. Connolly" <connolly@hal.com>
>>
>> Would some of the SGML experts out there look over the SGML
>> declaration? I think the capacities/quantities need some
>> tweaking.

>As far as the rest, my first comment is on the newly introduced line
>breaks in the public identifiers. [...] All of this to say that,
>if one wishes
>to introduce a line break in a public identifier, it *must* be done at
>an existing space to avoid changing the (normalized) minimum literal.

OK. Fixed. See:

http://www.hal.com/~connolly/html-spec/html.decl
$Id: html.decl,v 1.12 1995/02/08 06:14:01 connolly Exp $

>As far as the quantities, I've written a long message about that already.
>In summary, here are my suggestions, though other values may make as much
>sense (read my Jan4 message appended below):
>
> QUANTITY SGMLREF
> ATTSPLEN 2100
> LITLEN 1024
> NAMELEN 72 -- somewhat arbitrary; taken from
> internet line length conventions --
> PILEN 1024
> TAGLEN 2100

OK. Fixed.

>> Anybody who really knows about character set declarations is
>> invited to look those over too. I'm still not clear on the distinction
>> between NONSGML, UNUSED, SHUNCHAR, etc.
>
>I'm afraid I am not a character set expert. I see they haven't changed
>since the IETF draft when I last looked at them, so I doubt I'll have
>much more input. Hopefully some other SGML experts with more expertise
>in character sets will check it out (I'll ask around a bit).

This character set stuff is unbelievably obscure. I think there are
about three people on the planet that understand it (Charles Goldfarb,
Erik Naggum, and James Clark), and I don't think their understandings
agree with each other.

In message <95Jan30.082249pst.2760@golden.parc.xerox.com>, Larry Masinter write
s:
>
>My main concern at this point is that the SGML declaration for
>character sets in the HTML standard not get in the way of using
>an external "charset" declaration to actually set up what the
>character encoding of the document might be.

The thing to remember is that in defining HTML as an Internet Media
Type (aka MIME type) and an SGML application, we have a certain amount
of leeway in specifying how you take a MIME body part (i.e. content
type and sequence of octets) off the wire/disk and turn it into an
SGML entity (sequence of characters) for parsing.

Here's my model of how you do it:

1. Consult the charset parameter (which defaults to
US-ASCII, or perhaps ISO-8859-1 according to the
current HTTP spec). This gives you (1) an octets-to-characters
mapping, and hence (2) a character repertiore.

2. Construct an SGML declaration by taking the html.decl
public text and substituting a document character set
corresponding to the charset parameter above. (We have
to document the correspondence somewhere, but for
starters, US-ASCII maps to:
"ISO 646:1983//CHARSET
International Reference Version
(IRV)//ESC 2/5 4/0"
(and an appropriate DESCSET)
and ISO-8559-1 maps to ISO-646 plus:
"ISO Registration Number 100//CHARSET
ECMA-94 Right Part of
Latin Alphabet Nr. 1//ESC 2/13 4/1"

Anyone who knows enough about SGML character sets
s welcome to step forward and give counterparts
to Unicode-UCS-2, Unicode-UTF-8, Unicode-UTF08, ISO-2022-JP,
and other MIME charsets.

3. Convert the octets to characters via the specified charset.
For each line of text (delimited by CRLF, that is octets 13,10),
put an SGML record start character (as per the SGML
declaration from step 2) at the beginning, and an
SGML record end character at the end.

(note that this might mean that UCS-2 and other encodings
where octets 10 and 13 can be used in the encoding of characters
besides newlines cannot be used)

4. If the character sequence from step 3 does not begin with a
doctype declaration (e.g. <!doctype html ...>), construct
a prologue by consulting the level= and version= parameters.
For example, in most cases, we infer the prologue:

<!doctype html public "-//IETF//DTD HTML//EN">

If the character sequence has its own prologue, use that.

(Strictly speaking, you should skip past any whitespace
and SGML comments to find the doctype delcaration. But
an expedient decision procedure would be to test the
first three characters to see if they are "<!D" or "<!d")

5. Now you're done. You have a complete SGML document entity:
an SGML declaration, a prologue, and an instance. Parse
as per ISO-8879.

Daniel W. Connolly "We believe in the interconnectedness of all things"
Software Engineer, Hal Software Systems, OLIAS project (512) 834-9962 x5010
<connolly@hal.com> http://www.hal.com/%7Econnolly