A thought on implementation...

"Daniel W. Connolly" <connolly@hal.com>
Errors-To: listmaster@www0.cern.ch
Date: Thu, 17 Feb 1994 00:58:58 --100
Message-id: <9402162355.AA06475@ulua.hal.com>
Errors-To: listmaster@www0.cern.ch
Reply-To: connolly@hal.com
Originator: www-talk@info.cern.ch
Sender: www-talk@www0.cern.ch
Precedence: bulk
From: "Daniel W. Connolly" <connolly@hal.com>
To: Multiple recipients of list <www-talk@www0.cern.ch>
Subject: A thought on implementation...
X-Listprocessor-Version: 6.0c -- ListProcessor by Anastasios Kotsikonas
Content-Length: 2910

It occurs to me that it is unjustifiably difficult to do things to
HTML documents like:

	* list all the URL's in a node
	* list all the H1,H2,H3s in a node
	* find the title of a node

correctly because of bleed between the regular, context free, and
context sensitive idioms of SGML. For example:

	<XMP>this: <A HREF="abc"> looks like a link, but it's
	not because <XMP> is an RCDATA element, and STAGO is not
	recognized in RCDATA</XMP>

	<!-- this: <A HREF="abc"> looks like a link too! -->

	And this: a < b > c has no markup at all, even though it
	uses the "magic" < and > chars.

	<A HREF='<A HREF="wierd, but possible">'>I bet this would
	break most contemporary implementations!</a>

Suppose we decide to standardize on two things:

	(1) a DTD in the strictest sense of SGML compliance
	(or better yet, a set of architectural forms...)
	that defines HTML in a somewhat abstract sense in terms
	of elements, and character data (and entities?)

	(2) a context-free interchange language which is a subset
	of the SGML syntax.

You could use the DTD if you have real SGML tools and you want to
use minimization, comments, and < chars as data.

But for interchange within the WWW application, we'd agree that, for
example, the < character is _always_ markup, and we'd use &#60; for
the data character '<'.

Here are (at least some of) the rules we'd adopt over and above SGML:

* No <!-- --> comments
* No <![foo] .. ]]> marked sections
* Always use numeric character references for '<', '>', and '&'
	(no harm in using &lt;, &gt;, &amp; forms, I suppose)
* Use numeric character references for ", \n, \t inside attribute
value literals
* Always quote attribute value literals with double-quotes, not singe
quotes.
* Don't split attribute values across lines (Hmmm...)

Then the "search for HREF's" problem could be coded in ~20 lines of perl:

	while(<>){	# read a line
		while(/<[^>]*$/){	# line looks like ...<TAG
					# with no >... read another line
			$_ .= <>;
		}
		while(s/^<(\w+)([^>])*)>/){ # find a start tag
			local($gi) = $1;
			local($attrs) = $2;
			$gi =~ tr/a-z/A-Z/; # convert to upper-case
			if($gi eq 'A'){
				# for each attr...
				while($attrs =~ s/^(\w+)\s*=\s*"([^"])"\s*//){
					local($name, $val) = ($1, $2);
					print "HREF: val\n" if $name eq 'HREF';
				}
			}
		}
	}

I can almost guarantee that those 20 lines of perl are already in use
as a heuristic solution to that problem now (I looked at the elisp
code for emacs w3 client, and beleive me: it's all over the place).

It's clear to me that folks are going to write HTML parsers based on
intuition and experience with context-free languages. Learning exactly
how SGML parsing works is sufficiently difficult that folks won't do it.

In stead of declaring that perl code to be busted, why don't we agree
the SGML folks didn't know much about autonoma theory and tighten up
the definition of HTML a little?

Dan