suggested libWWW architecture

Dan Connolly <connolly@pixel.convex.com>

Mail folder: WWW Talk Jan-Mar 1993 Archives
Next message: Tony Johnson (415) 926 2278: "Re: web roaming robot (was: strategy for HTML spec?)"
Previous message: Guido.van.Rossum@cwi.nl: "Re: strategy for HTML spec? "

Message-id: <9301140111.AA21994@pixel.convex.com>
To: www-talk@nxoc01.cern.ch
Subject: suggested libWWW architecture
Date: Wed, 13 Jan 93 19:11:15 CST
From: Dan Connolly <connolly@pixel.convex.com>


I sent this to tim a while ago, but I don't think
he's had time to look at it.

Meanwhile, libWWW is becomming reentrant, but I still
think the architecture is kinda clumsy: you have to
have a big data structure describing the DTD, and
a routine for each element, etc.

This doesn't mesh well with the MidasWWW architecture, which
can read the DTD from the X resource database at
runtime.

I have an idea for an architecture that the linemode and
MidasWWW could share (along with other new implementations).

It's not radically different from the current libWWW, but
there's a lot of grunt-work between the current libWWW
and what I've got here. But I think the end result would
be much more usable.

We start with the HText class. In stead of the various
style and append methods, we have four methods in a
virtual function table:

typedef struct{
  int (*start_tag) PARAMS((SGML_Object this, CONST char* gi,
			    CONST char** attributes, int nattrs));
  VOID (*end_tag) PARAMS((SGML_Object this, CONST char* gi));

  VOID (*entity) PARAMS((SGML_Object this, CONST char* name));

  VOID (*data) PARAMS((SGML_Object this, CONST char* data, int char_qty));
}SGML_DocClass;

The linemode would declare something like:

SGML_DocClass griddoc = {HText_start_tag, HText_end_tag,
			HText_entity, HText_data};

The HText implementation is responsible for keeping track of
the stack of open elements, if it needs to.

On top of these we build some format parsing routines:

SGML_parse(void* dest, void* closure, void* stream, int (getc)(void*));
/* psuedocode:
   int read, content;
   char buffer[1000];
   SGML_DocClass *docclass = (SGML_DocClass*)closure;

   while( (read = SGML_read(buffer, content, stream, getc)) != EOF){
     switch(read){
       case SGML_start_tag:
         ... parse name, attributes ...
         content = (docclass->startTag)(dest, name, attrs);
         if(content = empty){
           (docclass->endTag)(name);
           content = MIXED; /*@@ could be ELEMENT */
         }
         break;

       case SGML_end_tag:
         ... parse name ...
         (docclass->endTag)(name);
         content = MIXED; /*@@ could be ELEMENT */
         break;

       case SGML_entity:
         (docclass->entity)(data, name);
         break;

       default:
         (docclass->data)(dest, buffer);
    }
*/

PlainText_parse(HText* dest, void* docclass, void* stream, int (getc)(void*));
/* psuedocode:
   (docclass->startTag)(dest, "HTML");
   (docclass->startTag)(dest, "BODY");
   (docclass->startTag)(dest, "PRE");
   keep a local buffer of about 1000 chars.
   Call (getc)(stream) until EOF.
   Call HText_data(dest, buffer) whenever buffer is full.
   (docclass->endTag)(dest, "PRE");
   (docclass->endTag)(dest, "BODY");
   (docclass->endTag)(dest, "HTML");
*/

GopherListing_parse(HText* dest, void* dummy, void* stream, int (getc)(void*));
/* psuedocode:
   (docclass->startTag)(dest, "HTML");
   (docclass->startTag)(dest, "BODY");
   (docclass->startTag)(dest, "MENU");
   while(Gopher_parse_line(stream, getc, type, name, host, port, path)){
      char addr[BIG];
      sprintf(addr, "gopher://%s:%d/%c%s", host, port, type, path);
      (docclass->startTag)(dest, "A",
                       "HREF", addr,
                       0);
      (docclass->data)(dest, name);
      (docclass->endTag)(dest, "A");
   }
   (docclass->endTag)(dest, "MENU");
   (docclass->endTag)(dest, "BODY");
   (docclass->endTag)(dest, "HTML"); 
*/


We register each of these with the following routine:

int
ContentType_register(CONST char* type, CONST char* subtype,
		HTParseProc parse, void* closure);

For example:

main()
{
  ContentType_register("TEXT", "X-HTML", HTML_parse, griddoc);
  ContentType_register("TEXT", "PLAIN", PlainText_parse, griddoc);
  ContentType_register("APPLICATION", "X-GOPHER",
			 GopherListing_parse, griddoc);
}


The following routine can be used for any MIME entity. It will dispatch
the appropriate parsing routine based on the content type header:

int
ContentType_parse(const char* ct, HText* dest, void* stream, int (getc)(void*));


Then we build some load routines, one per access scheme:
(note that this design separates format from the access scheme, which
allows us to, for example, load a gopher menu
from a local file, or load HTML text from a Gopher server)

/* I don't have error handling worked out yet. We need to have a coherent
   design for this. It's a mess in the current WWWlib. */

/* I think the WWW file: should be split into ftp: and local-file:.
   It's cleaner to implement; there are precedents in the MidasWWW local:
   scheme and the MIME ftp and local-file access-types. */

int
LocalFile_load(HText* dest, CONST char* path, CONST char* search)
{
  FILE* stream;

  if(stream = fopen(path)){
    const char* content_type = WWW_zen_content_type_from_extension(path);
    ContentType_parse(content_type, dest, (void*)stream, (int ()(void*))getc);
    fclose(stream);
    return 1;
  }else{
    /* log an error */
    return 0;
  }
}

int
FTP_load(HText* dest, CONST char* path, CONST char* search);

int
HTTP_load(HText* dest, CONST char* path, CONST char* search);

int
Gopher_load(HText* dest, CONST char* path, CONST char* search);
{
  const char* content_type = Gopher_zen_content_type_from_gtype_char(*path);
  char* host = HTParse(path, PARSE_HOST);
  char* portnum = HTParse(path, PARSE_PORT);
  int port = atoi(portnum);
  static char* tab = "\007";
  static char* crlf = "\015\012";

  void* stream = TCPOpen(host, port);

  if(stream){
    TCPwrite(stream, path, strlen(path);
    if(search){
      TCPwrite(stream, tab, 1);
      TCPwrite(stream, search, strlen(search);
    }
    TCPwrite(stream, crlf, 2);
    ContentType_parse(content_type, dest, stream, TCPgetc);
    TCPclose(stream);
    return 1;
   }else{
    /* log an error */
    return 0;
   }
}


Then we register these just like formats:

HTAccess_register(const char* name, HTLoadProc load, void* closure);


And the HTLoadDocument routine in HTAccess.c becomes this:

int
HTAccess_load(HText* dest, HTParentAnchor* p, CONST char* address)
{
  char* scheme = HTParse(address, PARSE_SCHEME);
  /* path is everything after the colon, except the anchor */
  char* path = HTParse(address, PARSE_HOST|PARSE_PORT|PARSE_PATH);
  char* anchor = HTParse(address, PARSE_ANCHOR);
  char* search = HTParse(address, PARSE_SEARCH_TERMS);
  HText dest = HText_new(p); /* check for doc already loaded in p @@ */
  void* closure;
  HTLoadProc load;

  if(load = /* load routine registered for scheme. find closure too */){
    (load)(dest, path, search, closure);
  }
  HTSelect(dest, anchor);
}


What do you think?

Dan