A bit of history: the current HTML specification effort started in May
'94, just after the Geneva conference. NCSA Mosaic and the linemode
browser were already widely deployed. We've been playing catch-up for
a long time.
The folks creating HTML documents, by and large, aren't reading the
HTML spec. They're just previewing in their browser.
I think it would be more accurate to say "If the HTML implementations
had been stricter, there would have been much fewer illegal documents."
Check the www-talk archive: I was the first one to compare the
original HTML spec to the SGML spec, cuz I was in the same position
you're in: trying to implement from that spec. That's how this whole
thing got started:
http://gummo.stanford.edu/html/hypermail/.www-talk-1992.messages/122.html
|Date: Thu, 25 Jun 92 16:59:59 CDT
|From: Dan Connolly <connolly@pixel.convex.com>
|
|Beware, for example, that an
|SGML parser will expand entity references in an attribute literal
|to produce the CDATA for the attribute value. So that
|<A HREF="A&P"> might be OK for the linemode browser,
|but an SGML parser will try to resolve &P.
But this is all crying over spilled milk. It's much more useful to say
"If we deploy validation tools with nice user interfaces, and if the
various conversion tools are refined to observe the HTML DTD, there
will be much fewer illegal documents in the future."
> I know this
> places a burden on both the parser implementor and the author of HTML
> text, but right now I am faced with "fixing" my SGML parser so that it
> implements all the peculiarities of HTML.
Amen, brother!
> To give an example, http://www.hpl.hp.co.uk/people/dsr/html3/HTMLandSGML.html,
> specifies how to resolve inconsistencies between HTML and SGML. THIS
> IS WRONG!
Agreed. This was recently fixed in the HTML 2.0 spec. It will get
fixed in the 3.x documents as soon as Dave gets around to it.
We have some administrative stuff to work out, though.
> Take for example the fact that a lot of implementations allow any
> character that is not a space or '>' in unquoted attribute values. As
> a result everybody specifies URLS without quoting them. But SGML
> clearly specifies that you are only allowed to use name characters!
So are you going to be the first software vendor to cease to support
this non-standard idiom? If so, I support you. Your users may
not. :-{
> Another example. Most parsers end a tag when at the first '>', even
> when it occurs inside an quoted attribute value. This case is
> explicitly mentioned in the HTML3 spec and users are suggested to use
> > to escape '>' in tags. It appears that the spec forces you to
> allow both. Which means that my parser will not be SGML compliant,
> because the HTML spec was not enforced strongly enough.
Hmmm... what it (the March, 1995 HTML 2.0 draft) says is:
Note: Some historical implementations consider any occurrence
of the ">" character to signal the end of a tag. For
compatibility with such implementations, when ">" appears in an
attribute value, it should be represented with a numeric
character reference, such as in: <IMG SRC="eq1.jpg" alt="a
> b">
Note the "should." Not "must."
So it suggests that, rather than "a > b", you use another
representation: "a > b". The document is still
conforming. Conforming parsers and systems will work. Where is the
conflict?
> I would like to see a HTML specification that is a strict subset of
> SGML.
So would we all. That's what we have in HTML 2.0, and what we will
have in HTML 3.0.
> It would mean that
> there are a lot of invalid documents out there, but these will have to be
> updated eventually. In the end there should be one HTML standard, and
> not just a bunch of interpretations of the standard.
I believe we are all in agreement here.
If you can find places in the HTML specs that you believe conflict
with the SGML standard, please continue to call them out.
Dan