Re: Re Dan on implementation

"Daniel W. Connolly" <connolly@hal.com>
Errors-To: listmaster@www0.cern.ch
Date: Thu, 17 Feb 1994 02:50:55 --100
Message-id: <9402170147.AA06698@ulua.hal.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: connolly@hal.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: "Daniel W. Connolly" <connolly@hal.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: Re: Re Dan on implementation 
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Length: 8780
In message <199402170040.AA13406@rock.west.ora.com>, Terry Allen writes:
>
>BTW, would explain "autonoma theory" and how it relates
>to such mundane things as the syntax for comments?  
>

So... you're one of these folks that drive the SGML bandwagon without
any background in formal systems, huh? ;-)

Ok... I'll try to summarize a 3 semester hour computer science class
in a few paragraphs:

About 50 years ago, some linguists figured out that there are a
handful of useful classes of languages:

	"Regular" languages: those that can be recognized with
	a finite state machine (hence the term "regular expression
	which is so pervasive about the unix priesthood)

	e.g. "any string consisting of all A's"

	The classical example of a language which is _not_ regular
	is "any string with ()'s balanced" -- it takes an arbitrary
	amount of state (i.e not finite) to remember how many ('s
	are still pending.

	"Context Free" languages: those that can be recognized
	with a pushdown autonomon (ala yacc)

	Examples of context free languages:
	"any string with ()'s balanced"
	"any string of the form xyz.zyx, i.e. fred.dref wilma.amliw"
	C (mostly -- "undeclared identifiers" and such are context-sens.)
	lisp s-expressions

	Examples of languages that are _not_ context free:
	"any string of the form xyz.xyz, i.e. fred.fred, wilma.wilma"

	"Context Sensitive" languages: nobody cares about these. They
	just lump them in with "General" languages.

	"General" aka "Turing Decidable" languages: languages
	that can be recognized by a turing machine (i.e. C-program).

	"Turing Undecidable" languages: languages that a computer
	can't recognize.

	Examples: "a C program that has no infinite loops"

If we keep HTML down to a context-free language composed of regular tokens,
then folks can write little 20-line ditties in perl, elisp, lex, yacc,
etc. and get real work done.

If we require real-time processing of all legal SGML documents,
we buy nothing in terms of functionality, and we render almost
all current implementations broken.


>| 	<XMP>this: <A HREF="abc"> looks like a link, but it's
>| 	not because <XMP> is an RCDATA element, and STAGO is not
>| 	recognized in RCDATA</XMP>
>
>Off the point, I'll bet Mosaic sees it as a link.

Exactly. So let me ask you: do you use mosaic anyway? I thought so.
Do you really think it's worth-while to implement a full SGML parser
in Mosaic just so you can use raw < in stead of &lt; ?

>| 	<!-- this: <A HREF="abc"> looks like a link too! -->
>
>How so?  It's in a comment, and so will be ignored by a parser.

Yes, by an SMGL compliant parser, but not by any parser built
out of standard parsing tools like regular expressions, lex, and yacc.
(well, actually, you could do it with lex, but it's a pain...)

>| 	And this: a < b > c has no markup at all, even though it
>| 	uses the "magic" < and > chars.
>
>But not in the magic combinations <[A-Za-z] etc.

Right. The famous "delimiter in context". Contrast this with the
vast majority of "context free" languages in use.

>Your argument so far does not indicate a need for this.  You have
>simply remarked that some conventions of SGML are not what you'd
>like them to be.  They're not what I'd like them to be, either,
>but SGML is where we're at so far as document markup, today.

My argument is based on the fact that HTML documents must be
parsed _interactively_, i.e. at reasonably high speed, i.e. in
real time. The run-time cost of full SGML parsing far outweighs
the benefits.

>| You could use the DTD if you have real SGML tools and you want to
>| use minimization, comments, and < chars as data.
>| 
>| But for interchange within the WWW application, we'd agree that, for
>| example, the < character is _always_ markup, and we'd use &#60; for
>| the data character '<'.
>
>Dan, HTML is defined as an SGML DTD.  If that's to continue to be so,
>you can't apply these restrictions---unless you want to write a
>crippled SGML parser that complains about free-floating < > etc.  

You say "crippled", I say "expedient". Remember: the documents are
still conforming. It's just the WWW client parser that's non-standard.

>Furthermore, there is no reason at all to
>use &#60; for &lt;, and it is a weakness of the present DTD that 
>it doesn't use the standard ISO pub and num entity sets.

No, it's not. The ISO pub and num entity sets are designed for
situations where the <, #, etc. characters represent system-specific
data entities, e.g. characters from a special "math" font.

The HTML DTD uses ISO-latin1 as its character set, and '<' is a
perfectly normal ISO latin 1 character -- number 60 in the collating
sequence. A parser is required to treat the strings "&#60;" and "<"
identically, unless "<" is followed by a letter...). It was a mistake
on my part early on to confuse &lt; with &#60;.

But as it turns out, the ISO pub and num entity names make handy
mnemonics for ISO characters. So we use them. But according to
the DTD, &lt; is defined to mean exactly the string "<" -- not
"whatever your system would like to use for a less-than sign"
as in the ISO pub entity set.

>
>| Here are (at least some of) the rules we'd adopt over and above SGML:
>| 
>| * No <!-- --> comments
>
>Over my dead body.  This is SGML.  Run it through a parser and you'll
>never have trouble with comments.

Hey, buddy, YOU run it through a parser :-). Seriously: if you want
to put these idioms in your source, then have your server remove them
before sending them to other WWW clients. Or batch-convert them. But
DON'T require every WWW client to parse it.

>  Lots of people want them, and
>it's a problem now that Mosaic, incorrectly, renders tagged text within 
>comments.


Ok... comments are useful. How about this: Comments must be of the form:

<!-- comment -->

and not

lksjdflkj <!-- comment in middle of line -->

nor

<!-- comment
split accross lines -->

>
>| * No <![foo] .. ]]> marked sections
>
>I don't care about this, but someone else may.  Why forbid them?

Because of the processing overhead! They're context-sensitive at
least, if not turing-decidable. This means adding thousands of lines
of code. And maintaining it...

>| It's clear to me that folks are going to write HTML parsers based on
>| intuition and experience with context-free languages. Learning exactly
>| how SGML parsing works is sufficiently difficult that folks won't do it.
>
>Too bad for them.  They aren't following the spec, then.  Please don't
>tell us we can't follow the spec.  I fully understand that the Webbers
>who first decided to define HTML as SGML bit off far more than they
>knew about, but for those of us who want to get our documents online,
>the HTML DTD is where the rubber hits the road.  Browser writers
>have to learn to live with SGML, warts and all.

The one thing I've learned about the internet is that the party who
writes and distributes code to implement his spec is the guy who
sets the standard. I bet I can get client developers to agree on
my idea sooner than you can get them to adopt SGML.

We'd accomplish these objectives:

	(1) These restricted HTML documents are still compliant.
	They still work with SGML tools.
	(1) We could teach folks what HTML looks like a whole lot easier.
	(2) We could write HTML processing software easier
	(3) We would increase confidence among authors that their
	documents will be rendered (and searched, indexed, outlined,
	and other wise processed) accurately.

I bet most SGML-producing tools follow these restrictions anyway. It's
just a question of getting the SGML-producing people to do it.

>
>| In stead of declaring that perl code to be busted, why don't we agree
>| the SGML folks didn't know much about autonoma theory and tighten up
>| the definition of HTML a little?
>
>You mean define a new "Dan's SGML."

Look: I tried _real_hard_ to define HTML in terms of SGML and get the
community to back me up. It didn't work. So I'm willing to shed some
of SGML's obscure features in favor of reliability.

>  I don't think this is a reasonable
>solution.  But, Dan, you have the energy to write TGML (my pet name for
>a hypothetical successor to SGML) that would do these things right, and 
>some others, too.  When someone gets around to writing it, whether it
>is part of a standards process or no, TGML will replace SGML in short 
>order, especially if it has a readable manual and a free parser.

Ah... so I guess we agree on that part! Keep in mind that these TGML
documents are still SGML documents in every sense of the word. It is
only the TGML parser which is not an SGML parser.

So if you have some SGML documents which are not TGML documents, just
run them through sgmls (in batch) and I'll provide a 20-line perl script
to output a "super-normalized" version acceptable to a TGML parser.

Dan