You would think so, huh?
This is another example of "SGML is like quantum mechanics: to learn
it, you have to put your intuition on hold and just believe."
The HTML spec, in section 3.4, "Working with Structured Text" has
a note:
NOTE: The SGML declaration for HTML specifies SHORTTAG YES,
which means that there are other valid syntaxes for tags, such
as NET tags, <EM/.../; empty start tags, <>; and empty end
tags, </>. Until support for these idioms is widely deployed,
their use is strongly discouraged.
The following productions from ISO 8879:1986, the SGML standard, show
that the above is legal:
[15] minimized start-tag = empty start-tag | unclosed start-tag |
net-enabling start-tag
[17] unclosed start-tag = stago, document type specification,
generic identifier specification,
attribute specification list, s*
A start-tag can be an unclosed start-tag only if it is
followed immediately by the character string assigned to
the stago or etago delimiter role, regardless of whether the
string begins a valid delimiter-in-context sequence.
In plain english, it means you can write stuff like:
<h1 <em>some text</em> </h1>
>Are comments preprocessed out before parsing the rest?
No.
>Is there a LEX/YACC grammar for HTML somewhere?
No. I've tried to come up with something useful a few times, but due
to strangeness like the above, I always get frustrated and give up.
We should establish some "tractible" subset of SGML that implementors
can be expected to deal with reliably. The TEI folks have done something
like that. You might want to check out their stuff:
http://etext.virginia.edu/TEI.html
It seems to be inaccessible due to some glitch right now, so I can't
point you to it, but they've developed a yacc-compatible set of
productions that are a simplification of the productions in the SGML
standard.
>What is the ultimate authority for these kinds of lexical/parsing questions?
>Is it the SGML spec + the HTML DTD?
Exactly.
>Any help would be greatly appreciated.
I hope the above helps. You're on the right track by using the
validation service. It implements ISO-8879:1986 as faithfully as
anything I have ever had access to. (There are a some commercial
products that do it better, but for HTML, sgmls should be good
enough.)
Dan
p.s. I hope you don't mind that I copied html-wg on this.