Re: SGML, HTML and CS:

Daniel W. Connolly (connolly@hal.com)
Mon, 12 Sep 94 00:48:41 EDT

In message <9409112039.AA26717@arctic.crl.dec.com>, raman@crl.dec.com writes:
>Most of the readers of this list probably read Eric Naggum's excellent article
>on comp.text.sgml about the need to make sgml more compliant with CS
>terminology and tools.
>
>Any opinions?

Mixed. I had many of the same things to say. Check the www-talk and
www-html archives for the "A Thought on Implementation..." thread.

>Far from it, it's been a struggle. What's worse, most of the sgml tools seem
>to be totally incomprehensible. Every DTD or specification document I read is
>littered liberally with iso standard numbers, (which make no sense to me ) and
>though I know I should not complain about surface syntax, I find the syntactic
>presentation of DTD's extremely difficult to absorb.

Welcome to the club :-{

>Considering that an HTML document represents a fairly simple hierarchical
>structure, why not start describing it as such?

Be my guest... you'll find that when you get down to it, given the
current state of affairs, a couple hunderd line SGML DTD is about
the most precise, succint description of HTML you'll find. I can
tell you this from over 2 years of experience. Though even just
yesterday I tried (somewhat successfully, I might add...) to capture
SGML syntax in a lex description. I'll attach it below just for
grins.

>This would make the task of writing parsers easier, and also
>encourage good HTML.
>Currently, the definition of valid HTML is so inaccessible even to the
>practicing computer scientist, leave alone the author of a document, that the
>only validation being used is "Does Mosaic display this document?"
>according to some subjective measure of "correct display".

Amen, brother. You're preaching to the choir. Now: break out your time
machine, go back a few years and talk TimBL out of basing HTML on SGML
(or maybe it was me that really made the connection between HTML and
SGML -- but it was Tim's idea). Better yet, go back 10 or 15 years
and teach the SGML committee about compiler technology and automated
parsing.

>At present, I have a hard time understanding for example what kinds of nesting
>are allowed by a particular DTD, when reading the HTML spec, I just resorted
>to the descriptive statements for each of the elements.
>
>I spent considerable time installing/understanding SGMLS a couple of months
>ago, and after fighting hard even managed to find a dtd for html on the net as
>well as the other files necessary to make sgmls parse simple html
>documents. But the whole process of running SGMLS is so obfuscated, i can't
>remember all the things I needed to do now, and after retrieving the latest
>DTD just gave up on trying to validate my documents using it and SGMLS as a
>waste of time.
>This state of affairs is frightening!

Yes, this state of affairs is frightening, and we have a lot of work
to do to properly deploy the HTML specification to the point where
it is accessible to software engineers and authors.

But I must protest at any form of SGMLS bashing. SGML is a strange
beast -- horribly contorted and impenetrably specified. But I learned
90% of what I know using SGMLS and the "try it and see" method. (another
5% was gleaned from patiently reading Erik Naggum's posts and sifting
through the noise for the good stuff. The remaining 5% I actually
got by reading ISO 8879 itself. I recommend the reading of that document
as pennance only for the most heinous of sins.)

James Clark has done an immesurable service to the net.community
by making SGMLS the high quality peice of software that it is, and
for providing it to the development community free of charge.

>I may be wrong, but I somehow get the impression that the whole story
>regarding sgml/HTML has been made more complicated/obfuscated than it needs to
>be.

Oh, my brother: would that you were wrong! After spending about two
weeks reading the SGML standard, one realizes that SGML provides few
features above and beyond lex/yacc. It is disheartening to realize that
a technology that should represent one man-month to implement actually
requires more like a man-year or two. There should have been a libSGML
years ago that would, by now, be in /usr/lib on every machine on
the planet.

Don't even get me started... :-)

>I know these are radical statements, but something needs to be done to make
>sgml/HTML validation and processing more palatable, or we'll have to spend the
>rest of our careers retrofitting our documents lto kluges like Mosaic.

I'm afraid the only way out at this point is lots of good documentation
and support. The real damage is done. Arguments to the contrary
are more than welcome.

Dan

There are no comments in this little ditty, so you may
find it yet another piece of impenetrable documentation
on SGML...

#!/bin/sh
# shar: Shell Archiver (v1.22)
#
# Run the following text with /bin/sh to create:
# sgml.l
# SGML.h
#
sed 's/^X//' << 'SHAR_EOF' > sgml.l &&
X
X%{
X
X#include <ctype.h>
X#include <string.h>
X#include "SGML.h"
X
Xenum { MAXBUF=4096 }; /* enough for all the names and values in any tag */
X
X#define BUFCAT(b, i, s, l) (memcpy(b+i, s, i+l<MAXBUF ? l : MAXBUF-1-i), \
X i + l >= MAXBUF ? (i=MAXBUF-1) : (i += l))
X
Xstatic char *storename(char *buf, int *cur, int max,
X char *name, int namelen);
X
Xstatic const char* lookup_entity(const char *name);
X
XHTMarkup mu;
X%}
X
XS [ \t\f\r\n]
XLETTER [A-Za-z]
XDIGIT [0-9]
XNMCHAR {LETTER}|{DIGIT}|[\.]
XNAME {LETTER}{NMCHAR}*
XNUMBER {DIGIT}+
XTOKEN ({LETTER}|{DIGIT}){NMCHAR}*
X
X%x tag
X%x tagattr
X%x etag
X%x lit
X%x lita
X%x com
X%x com2
X
X%%
X static char buf[MAXBUF]; /* @# not reentrant! */
X int bp = 0;
X int aqty;
X int apending;
X
X BEGIN(0);
X
X"<!--" { BEGIN(com); }
X<com>"--" { BEGIN(com2); }
X<com2>"--" { BEGIN(com); }
X<com>. {}
X<com2>">" { BEGIN(0); }
X<com2>. { /*@@ error hmm... */ }
X
X"<!"{NAME}[^>]+">" { /* @@ markup decl... not processed right! */ }
X
X"<?"[^>]+">" { /* @# report PI to app? */ }
X
X
X"<"{NAME} {
X if(bp>0){
X yyless(0);
X buf[bp++] = 0;
X mu.u.data = buf;
X return Data;
X }
X
X mu.type = StartTag;
X
X mu.u.startTag.gi = storename(buf, &bp, MAXBUF,
X yytext+1, yyleng-1);
X
X aqty = apending = 0;
X BEGIN(tag);
X }
X
X<tag,tagattr>{S} {}
X
X<tag,tagattr>{NAME} {
X if(apending){
X /* @# IMG ISMAP hack */
X mu.u.startTag.attrval[aqty] =
X mu.u.startTag.attrname[aqty];
X aqty++;
X }
X
X if(aqty < ATTCNT){
X if(mu.u.startTag.attrname[aqty] =
X storename(buf, &bp, MAXBUF, yytext, yyleng))
X apending = 1;
X else apending = 0;
X } /* @# else fail silently? */
X
X BEGIN(tagattr);
X }
X
X<tagattr>"="{S}*{TOKEN} {
X char *pp = yytext+1;
X int l = yyleng-1;
X while(isspace(*pp)) { pp++; l--; }
X
X if(aqty < ATTCNT){
X if(mu.u.startTag.attrval[aqty] =
X storename(buf, &bp, MAXBUF, pp, l))
X aqty++;
X
X } /* @# else fail silently? */
X apending = 0;
X BEGIN(tag);
X }
X
X<tagattr>"="{S}*["] {
X if(aqty < ATTCNT){
X mu.u.startTag.attrval[aqty] = buf+bp;
X }
X BEGIN(lit);
X }
X
X<tagattr>"="{S}*['] {
X if(aqty < ATTCNT){
X mu.u.startTag.attrval[aqty] = buf+bp;
X }
X BEGIN(lita);
X }
X
X<lit,lita>"&#"{NUMBER}";"? {
X if(aqty<ATTCNT){
X char c = atoi(yytext+2);
X BUFCAT(buf, bp, &c, 1);
X }
X }
X
X<lit,lita>"&"{NAME}";"? {
X /*@@ entity ref in attr val! */
X }
X
X<lit>\" |
X<lita>\' {
X if(aqty < ATTCNT){
X buf[bp++] = 0;
X aqty++;
X } /* @# else fail silently? */
X apending = 0;
X BEGIN(tag);
X }
X
X<lit,lita>"&" |
X<lit>[^"&]+ |
X<lita>[^'&]+ {
X if(aqty < ATTCNT) BUFCAT(buf, bp, yytext, yyleng);
X }
X
X<tag,tagattr>">" {
X if(apending){
X /* @# IMG ISMAP hack */
X mu.u.startTag.attrval[aqty] =
X mu.u.startTag.attrname[aqty];
X aqty++;
X }
X
X mu.u.startTag.attrQty = aqty;
X return StartTag;
X }
X
X"</"{NAME} {
X if(bp>0){
X yyless(0);
X buf[bp++] = 0;
X mu.u.data = buf;
X return Data;
X }
X
X mu.type = EndTag;
X mu.u.endTag =
X storename(buf, &bp, MAXBUF, yytext+2, yyleng-2);
X BEGIN(etag);
X }
X
X<etag>{S}+ {}
X
X<etag>">" { return EndTag; }
X
X<tag,etag>. {
X printf("error: `%s'\n", yytext);
X return -1; /* @# what to do? */
X /* @@ hmmm... unquoted literals? */
X }
X
X&{NAME}";"? {
X char *name = yytext + 1;
X int l = yyleng - 1;
X const char *cdata;
X
X if(name[l-1] == ';') { l--; name[l] = 0; }
X
X if(cdata = lookup_entity(name)){
X BUFCAT(buf, bp, cdata, strlen(cdata));
X }else{
X mu.type = EntityRef;
X mu.u.entityRef =
X storename(buf, &bp, MAXBUF, yytext+1, l-1);
X return EntityRef;
X }
X }
X
X"&#"{NUMBER}";"? { char charnum = atoi(yytext+2);
X BUFCAT(buf, bp, &charnum, 1);
X }
X
X[^<&]+ |
X. {
X if(bp + yyleng >= MAXBUF){
X int fit = MAXBUF - 1 - bp;
X yyless(fit);
X BUFCAT(buf, bp, yytext, fit);
X buf[bp++] = 0;
X return Data;
X }
X
X BUFCAT(buf, bp, yytext, yyleng);
X }
X
X%%
X
Xstatic char *storename(char *buf, int *curP, int max,
X char *name, int namelen)
X{
X int cur = *curP;
X char *ret = 0;
X
X if(cur + namelen < max){
X ret = buf + cur;
X
X memcpy(buf+cur, name, namelen);
X cur += namelen;
X buf[cur++] = 0;
X
X *curP = cur;
X }
X
X return ret;
X}
X
X
Xint yywrap()
X{
X return 1;
X}
X
X#define TEST 1
X
X#ifdef TEST
X
Xstatic const char *
Xlookup_entity(const char *name)
X{
X /* HTML specfic @@ */
X if(strcmp(name, "amp") == 0) return "&";
X else if(strcmp(name, "lt") == 0) return "<";
X else if(strcmp(name, "gt") == 0) return ">";
X else if(strcmp(name, "quot") == 0) return "\"";
X else return 0;
X}
X
X
Xmain(int argc, char **argv)
X{
X int tok;
X int lvl = 0;
X
X while(tok = yylex()){
X switch(tok){
X case StartTag:
X {
X int i;
X
X putchar('\n');
X for(i = 0; i < lvl; i++) putchar('=');
X
X printf("<%s", mu.u.startTag.gi);
X for(i = 0; i < mu.u.startTag.attrQty; i++){
X printf(" %s=\"%s\"",
X mu.u.startTag.attrname[i],
X mu.u.startTag.attrval[i]);
X }
X printf(">\n", mu.u.endTag);
X lvl++;
X }
X break;
X
X case EndTag:
X {
X int i;
X
X lvl--;
X putchar('\n');
X for(i = 0; i < lvl; i++) putchar('=');
X
X printf("</%s>\n", mu.u.endTag);
X }
X break;
X
X case Data:
X printf("%s", mu.u.data);
X break;
X
X case EntityRef:
X printf("entity: %s\n", mu.u.entityRef);
X break;
X
X default:
X printf("what the heck is %d?\n", tok);
X abort();
X }
X }
X
X return 0;
X}
X
X#endif
X
SHAR_EOF
chmod 0644 sgml.l || echo "restore of sgml.l fails"
sed 's/^X//' << 'SHAR_EOF' > SGML.h &&
X#ifndef __SMGL_h
X#define __SMGL_h
X
Xenum {
X NAMELEN = 72,
X LITLEN = 1024,
X ATTCNT = 40
X};
X
Xtypedef struct _HTMarkup HTMarkup;
X
Xstruct _HTMarkup{
X enum { Data=1, StartTag, EndTag, EntityRef } type;
X union{
X char *data;
X
X struct{
X char *gi;
X int attrQty;
X char *attrname[ATTCNT];
X char *attrval[ATTCNT];
X }startTag;
X
X char *endTag;
X
X char *entityRef;
X }u;
X};
X
X
X#endif /* __SMGL_h */
SHAR_EOF
chmod 0644 SGML.h || echo "restore of SGML.h fails"
exit 0