Re: Last call: Intro, SGML, MIME sections

Dan Connolly (connolly@w3.org)
Thu, 4 May 95 12:44:15 EDT

lilley@afs.mcc.ac.uk writes:
>
> > Hence the terminals above parse as:
>
> > HTML
> > |
> > \-HEAD, BODY
> > | |
> > \-TITLE \-P
> > | |
> > | \-<P>,"Some text. ",EM
> > | |
> > | \-<EM>,"*wow*",</EM>
> > \-<TITLE>,"Parsing Example",</TITLE>
>
> Given certain historical problems with P, I would be happier if the
> first occurence of P in an example in the standard showed a closing P tag
> somewhere.

I think you would be in the minority. </P> tends to send many folks
into a fit of rage, whereas the lack of </P> doesn't bother the folks
who grok P as a container. This is completely arbitrary, as far as I'm
concerned. Any other opinions?

> Either in the example document or in the parse tree.

The P element contains <p>, "Some text" and and EM element. So the
fact that P is a container is represented. Having elements "contain"
their tags is a little non-traditional, but it's consistent with the
way derivations and parse trees are represented in computational
linguistics literature and I like it :-)

> Yes, I am aware that the closing </p> can be omitted. But as the parse
> tree shows HEAD and BODY being inferred, could it not show </P>, </BODY>
> and </HTML> being inferred as well? Just to make the point early on?

That's not the way I see it, nor the way I was trying to represent it.
It doesn't show tags being "inferred" so much as elements that sometimes
have tags and sometimes don't.

If you have a grammar:

a -> b y? c
b -> x
c -> z

and you parse the string 'xz', then you'd show the parse tree as:

a
|
\-b
| |
| \-x
|
\-c
|
\-y

wouldn't you? Or would you put the y in there, even though it's not
in the original string? That would be a strange version of a parse
tree.

> > The syntax character set for all HTML documents is ISO-646-IRV.
>
> The word syntax is not a link, so the term 'syntax character set' does
> not seem to be defined. It would aid clarity if it were.

OK. Done. Ack! the real term is syntax-reference character set.

> > Note that the terminating semicolon is only necessary when the character
> > following the reference would otherwise be recognized as markup:
>
> True, but perhaps this should say a little more strongly that the
> trailing ; is not actually wrong, just that it can be omitted in this
> instance if you really want.

I'm gonna leave this alone unless you (1) motivate your case more
strongly, or (2) provide replacement text. (I prefer (2)).

> > Version
> > To help avoid future compatibility problems, the version parameter may
> > be used to give the version number of the specification to which the
> > document conforms.
>
> If omitted, what does it default to? The current highest version number that
> has been standardised?

Hmmm... if there's no "version=.." parameter, then the guy hasn't
given you anything that will help avoid compatibility problems. I'm
not really sure what this thing is for. It's kind of a placeholder, in
case we can use it in the future to work around some problem. It could
be scrapped altogether, as far as I'm concerned.

The "Toward Graceful Deployment of Tables" document suggested level=3
as the tables format negociation key. That's been implemented (in
arena and the apache server) and it works.

In general, how do we negociate media types with parameters? If we
want a parameter that has common semantics across media types, I'd
prefer version= to level. Has the HTTP working group hashed this out?

> > Charset
>
> Again it would be helpful to say explicitly what happens when this is omitted

Good catch. This was in an earlier draft, and almost got lost between
the cracks:

The default value is outside the scope of this specification;
but for example, the default is US-ASCII in the context of
MIME mail, and ISO-8859-1 in the context of HTTP.

> > HTML user agents must support the ISO-8859-1 character encoding scheme,
> > and hence the US-ASCII character encoding scheme. (9)
>
> I feel you should either use ASCII or ISO-646-IRV throughout (and have a
> footnote explaining the relationship between the two).

US-ASCII is a character encoding scheme, as per MIME. ISO-646-IRV is a
coded character set, as per SGML and the reset of ISO. Unfortunately,
ISO-8859-1 serves as both, so it's confusing. I've done my best to
be accurate, if not easy to understand. Care to take a stab at
a clarification?

> Lastly, I should say that my overall feeling from reading the spec is that it
> seems rather stiff and impenetrable, even when you understand what it is talking
> about. I would compare the language used to an ISO standard, which you may take
> as a compliment or not according to preference ;-)

I tend toward minimalism. I value conciseness, precision, and
rigor. And frankly, I don't have a lot of time for easy-reading
stuff. But I'll work on it as time permits, and I'm open to
submissions.

Thanks for the prompt and thorough response.

Dan

p.s. anything I didn't respond to was incorporated.