SGML is an _ugly_ solution to an _elegant_ problem [was: Parsing < and <]

Dan Connolly (connolly@w3.org)
Mon, 24 Apr 95 20:50:35 EDT

Jon Bosak writes:
> > The as I learn more about SGML, the less it seems an elegant solution and
> > the more it seems a mass of twisty little gotchas (like the `mixed content'
> > gotcha, for example). Is this just a phase in the learning curve, or is
> > SGML really like that?
>
> SGML is really like that. But it can't be any other way, because the
> problem it solves is not an elegant problem and does not admit of an
> elegant solution.
> This is very hard for computer folks to accept, but
> relatively easy for those of us who have had some exposure to either
> linguistics or the philosophy of language. If people thought the way
> that computers did, the problem would be much simpler, and so would
> SGML. But they don't, so it's not. That's the bottom line of
> approximately 25 years of collective inquiry into this subject.

Get real. Er, ahem. I disagree. :-)

SGML solves almost exactly the same problem as lex and yacc. Those
tools are ugly in their own way, but the problem is very much an
elegant problem: parsing context-free languages composed of regular
tokens.

The problem was crystalized in Chomsky's original work on languages in
the '60s. Not computer languages -- this guy came from the
humanities: he was studying the commonalities between human languages.
He coined the distinction between regular, context-free, and
context-sensitive languages. I don't recall if he made the distinction
about turing-decidable languages -- something tells me that was Turing
or Church or one of those guys.

I don't have a citation handy, but you can read the fruits of his
labors in the dragon book ("Compiler tools and Principals" by Aho,
Sethi, Ulman or something) or a book on automata theory (Louis and
Popodometrieau or something, for one).

In fact, thanks to the web, I do have a citation handy:

Chomsky, N. (1963)
``Formal properties of grammars''. In R.D. Luce, R.B. Bush and E. Galanter
(eds.) Handbook of Mathematical Psychology, Volume 2. New York: John
Wiley.
Aho, A.V. Sethi, R. and Ullman, J.D. (1986)
Compilers: Principles, Techniques and Tools. Reading, Mass.:
Addison-Wesley.

Heck... this whole biliography I found seems dedicated to this
subject:

http://itkwww.kub.nl:2080/itk/Docs/Think/2-1/bib-srt.html

The C.S. Tech Report indexes are chock full of references to Chomsky's
work. This query alone yielded 41 citations:

http://harvest.cs.colorado.edu/Harvest/cgi-bin/BrokerQuery.pl.cgi?query=Chomsky&host=harvest.cs.colorado.edu%3A8505&caseflag=on&wordflag=on&errorflag=0&opaqueflag=on&descflag=on&csumflag=on&verbose=on&maxresultflag=500

The point is that (1) it is an elegant problem with tractible
solutions, and (2) this was well-known at the time of SGML
specification development (by everyone but the SGML designers, it
seems).

Had the SGML designers done their homework, the definition of a
conforming SGML document would be something like this:

"An SGML document is:
* a coded character set
* a set of terminals, each with an associated regular expression
over the above character set
(the SGML declaration, which should have about the
same expressive capability as a lex specification,
less the C code)
* an unambiguous context-free grammar over the above set
of terminals
(the prologue, or DTD, which should have about the
same expressive capability as a yacc grammar, with
syntactic sugar for inclusions and exclusions)
* a string over the coded character set
(the instance)
If the sequence of tokens determined by the instance
and the SGML declaration is in the language generated
by the DTD grammar, the document is a conforming
SGML document."

The whole SGML specification should boil down to that: something that
can be defined in about a page, in terms of the terminology of
computational linguistics. Add a few pages to explain the syntax of
the DTD and declaration, and you're done. (Better yet: use
s-expression syntax and save yourself the trouble.) Voila! A
conforming SGML parser is a one man-month project, rather than an >1
man-year project.

The differences between the above and the actual SGML specification
are just noise, which I attribute to a lack of clear thinking and
research by the designers of SGML. I have seen papers written at the
time of the development of SGML that pointed out these differences,
and I heard that these objections were "shouted down," not argued
technically. That thought galls me. I will not stand for it here.

OK, so SGML has some novelties, like inclusions, exclusions, and short
references. But Inclusions and exclusions are just syntactic sugar to
represent exponentially large context free grammars. But they _don't_
give any expressive power beyond context free grammars. They just save
a bunch of typing. Granted, minimizing typing -- representing common
idioms succinctly, that is -- is a large part of language design. But
that doesn't mean you have to make up a new, broken, baroque parsing
strategy while you're incorporating these idioms.

Short references and entity references become part of the lexer. Perhaps
there are a few details to work out there, but nothing really hairy.

Why the silly newline rules? "To get SGML applications to treat
newlines consistently," I've heard. Seems like a prime example of
overspecification to me. If the applications need to be consistent,
they'll be consistent. No need to gook up the spec to twist arms like
this.

Why the silly tag inference rules? Why not just use LALR(1) parsing
rules? Why the silly rules about ambiguous content models? Why not
just take the whole DTD and test it for ambiguity (or LALR(1)-ness if
you want to avoid backtracking) and be done with it?

> But if you understand what SGML attempts to
> do (which is fairly awesome) and you understand what writers do (which
> is fairly awful, in both the original and current senses of the word),
> then you will understand that radically simplifying SGML is a task as
> huge and as ultimately doomed as, say, radically simplifying the
> English language. You can make the computer happy, but then
> Shakespeare doesn't fit in there any more.

After 4+ years studing the formalisms involved, and 4+ years applying
the theory in the documentation tools industry (dealing with writers
and the tools they use daily), I cannot disagree strongly enough.

If you _really_ understand what SGML is trying to do, you realize it's
just lex and yacc with a couple twists.

All writers want is a nice point-and-click interface that behaves
reliably, if not intuitively. They'll hack around whatever tools you
try to confine them with -- that's their nature, I agree. But making
SGML baroque only makes everybody's job harder.

I agree that fixing SGML now might be intractible: but it's due to
the legacy of existing documents and the fact that we can't check
to be sure there aren't any documents that would become illegal
if SGML were cleaned up, not because the original problem SGML set
out to solve is as hairy as SGML makes it out to be.

In fact, I've heard some talk about specifying a tractible variant of SGML
in the upcoming revision -- using a tweak in the SGML declaration to
say: "this document uses only Chomsky-happy features." If the debate
about this stuff is availble on the internet, I'll take part.

But I'd really rather get on to distributed objects, virtual reality,
and other more interesting stuff. These linguistics problems were
solved in the '60s and '70s, for chrissakes!

Daniel W. Connolly "We believe in the interconnectedness of all things"
Research Technical Staff, MIT/W3C
<connolly@w3.org> http://www.w3.org/hypertext/WWW/People/Connolly