Re: Links that refer to a range of text, not just a point.

Dan Connolly <connolly@pixel.convex.com>
Message-id: <9206241715.AA10088@pixel.convex.com>
To: davis@willow.tc.cornell.edu (Jim Davis)
Cc: www-talk@nxoc01.cern.ch
Subject: Re: Links that refer to a range of text, not just a point. 
In-reply-to: Your message of "Wed, 24 Jun 92 12:00:50 EDT."
             <9206241600.AA13167@willow.tc.cornell.edu> 
Content-Type: multipart/mixed; boundary="8<--"
Mime-Version: 1.0
Date: Wed, 24 Jun 92 12:15:52 CDT
From: Dan Connolly <connolly@pixel.convex.com>
--8<--

>But then this raises another issue: does WWW allow anchors within
>anchors?  I think not - in which case I could not use WWW anchors to
>both label a paragraph (e.g. for attaching an annotation) and a word
>within it (e.g. for definition).  This worries me quite a bit.  Nor
>can I attach multiple links to the same point (e.g. definitions of a
>word in multiple languages).
>
This and other related questions (can I have lists within lists?)
are precisely the reason for using a well-defined structural markup
language governed by SGML processing rules.

Right now we have no DTD for HTML, and the only answers lie in
the browser source code. The documentation "in the web" is too
vague. But I hardly think we want the browser source code to
be the definition of HTML.

On the other hand, I tried and failed to come up with a DTD which
described HTML in such a way that the existing documents are legal.

Before this situation gets out of hand, we need to establish a
(possibly evolving, but at least exsting) SGML DTD for HTML.
This will require incompatible changes to the definiton of HTML
(for instance, the PLAINTEXT, XMP, and LISTING features of HTML
don't quite fit into SGML).

Enclosed* is a proposed DTD, a perl hack to patch existing files,
and a sample patched file. I invite:

= SGML experts to round out the DTD (should we include
  stuff from the ISO General DTD? the AAP article DTD?
  ISOnum, ISOpub etc. standard entities? How about
  using the QWERTZ LaTeX-like DTD?)

= SGML non-experts to become conversant in SGML (it's
  coming whether you like it or not)

= HyTime experts to add to the confusion. Seriously, I'd
  like to know what HyTime has to offer.

= DSSSSSSSSSSL experts to do the same

= the WWW team to adapt existing code to match this DTD
  (or some real DTD)

= HTTP server sites to then update their HTML files


	*encosed in the MIME multipart/mixed enclosure sense

--8<--

<!-- This DTD was produced by DeveGram on Tue Jun  2 18:58:16 1992 -->
<!-- and hand-edited by connolly@convex.com -->

<!-- typical usage:

  <!DOCTYPE web-node SYSTEM 
    [
    <!ENTITY UDI011 SDATA
      "http://info.cern.ch/hypertext/DataSources/NewsFromVM/Overview.html">
    ]>
 -->

<!--     Parameter Entities       -->

<!--      Terminal symbols        -->

<!ENTITY % words "#PCDATA" >

<!--    Non-ELEMENT symbols       -->

<!ENTITY % inline       "%words | A" >
<!ENTITY % text         "%inline | P | IMAGE" >
<!ENTITY % heading "H1|H2|H3|H4|H5|H6" >

<!ENTITY lt "<">
<!ENTITY gt ">">
<!ENTITY amp "&">

<!ENTITY lt. "<">
<!ENTITY gt. ">">
<!ENTITY amp. "&">

<!--     Document structure       -->

<!ELEMENT WEB-NODE      O O  (TITLE, NEXTID?, ISINDEX?, section+, ADDRESS?)>

<!ELEMENT TITLE - -  (%inline)+>
<!ELEMENT ADDRESS - - (%text)+>

<!ELEMENT NEXTID - O EMPTY >
<!ATTLIST NEXTID N NUMBER #IMPLIED>

<!ELEMENT ISINDEX - O EMPTY >


<!ELEMENT section O O ((%heading)?,
                        (
                        %text |
                        section |
                        MENU |
                        UL |
                        OL |
                        DIR |
                        DL)+)>

<!ELEMENT (H1|H2|H3|H4|H5|H6)   - -  (%inline) >

<!ELEMENT P     - O  EMPTY -- paragraph SEPARATOR -->

<!ELEMENT IMAGE - O EMPTY>
<!ATTLIST IMAGE DATA ENTITY #REQUIRED>

<!ELEMENT A     - -  (%inline)+>
<!ATTLIST A
        NAME CDATA #IMPLIED
        HREF ENTITY #IMPLIED
        TYPE CDATA #IMPLIED --@@-- >

<!ELEMENT MENU  - -  (LI+)>

<!ELEMENT UL    - -  (LI+)>

<!ELEMENT OL    - -  (LI+)>

<!ELEMENT DIR   - -  (LI+)>

<!ELEMENT LI    - O  (%text)+>

<!ELEMENT DL    - -  ((DT, DD)+)>

<!ELEMENT DT    - O  (%inline)+>

<!ELEMENT DD    - O  (%text)+>

--8<--

And here's a perl script that attempts to patch up existing
HTML files:

--8<--

#!/usr/local/bin/perl
#
# USE
#   fix-html.pl <W3-file.html >W3-file.sgml
#
# SEE ALSO
#   the web-node.dtd.
#

print "<!DOCTYPE WEB-NODE SYSTEM \n[\n";

@html = <>;                     # read whole file
$_ = join('', @html);
$out = '';

$header = 0;
$anchor = "UDI000";
while(/</){
    $out .= $`;
    $_ = $';
    if(s/^A\s+//i){
        &fix_anchor;
    }elsif(s/^NEXTID\s+(\d+)\s*>//){
        $out .= "<NEXTID N=$1>";
    }elsif(s/^H(\d)>//){
        local($n) = $1;
        while($n<=$header){ $out .= "</SECTION>"; $header--; }
        while($n>$header){ $out .= "<SECTION>"; $header++; }
        $out .= "<H$n>";
    }else{
        $out .= '<';
    }
}

$out .= $_;

foreach(keys %anchor){
    local($ent) = $anchor{$_};

    print "<!ENTITY $ent SDATA \"$_\">\n";
}

print "]>\n", $out;

sub fix_anchor{
    local($name, $href, $type);

    # What exactly is the syntax of an SGML attribute value?
    while(s/^(\w+)\s*=\s*((\"[^\"]*\")|([^\s>]+))\s*//){
        local($v) = ($3 || $4);
        local($a) = $1;
        $href = $v if $a =~ /^href$/i;
        $name = $v if $a =~ /^name$/i;
        $type = $v if $a =~ /^type$/i;
    }
    s/[^>]*>//;

    $out .= "<A";
    $out .= " NAME=\"$name\"" if $name ne '';
    $out .= " TYPE=\"$type\"" if $type ne '';
    if($href ne ''){
        if(!defined($anchor{$href})){
            $anchor{$href} = ++$anchor;
        }
        $out .= " HREF=" . $anchor{$href};
    }
    $out .= ">";
}

--8<--

Here's my default.html run through the above script:

--8<--

<!DOCTYPE WEB-NODE SYSTEM 
[
<!ENTITY UDI011 SDATA
"http://info.cern.ch/hypertext/DataSources/NewsFromVM/Overview.html">
<!ENTITY UDI006 SDATA "http://crnvmc.cern.ch./FIND">
<!ENTITY UDI020 SDATA "http://info.cern.ch/rpc/doc/User/UserGuide.html">
<!ENTITY UDI013 SDATA
"http://info.cern.ch/hypertext/DataSources/bySubject/Overview.html">
<!ENTITY UDI021 SDATA "http://otax.tky.hut.fi/tky/default.html">
<!ENTITY UDI017 SDATA
"http://info.cern.ch:8001/archive.orst.edu:9000/archie-orst.edu">
<!ENTITY UDI010 SDATA "http://crnvmc.cern.ch/NEWS/student">
<!ENTITY UDI019 SDATA
"http://info.cern.ch./hypertext/Products/WAIS/Sources/Overview.html">
<!ENTITY UDI002 SDATA "http://info.cern.ch/hypertext/WWW/TheProject.html">
<!ENTITY UDI007 SDATA "http://crnvmc.cern.ch/NEWS/?">
<!ENTITY UDI001 SDATA "QuickGuide.html">
<!ENTITY UDI012 SDATA
"http://info.cern.ch/hypertext/DataSources/News/Overview.html">
<!ENTITY UDI003 SDATA "http://crnvmc.cern.ch./WHO">
<!ENTITY UDI022 SDATA
"gopher://gopher.micro.umn.edu:70/11/Other%20Gopher%20and%20Information%20Servers">
<!ENTITY UDI005 SDATA "http://crnvmc.cern.ch./FIND/jaune?">
<!ENTITY UDI004 SDATA "http://crnvmc.cern.ch./FIND/yellow?">
<!ENTITY UDI016 SDATA "http://crnvmc.cern.ch/FIND/DESY?">
<!ENTITY UDI009 SDATA "http://crnvmc.cern.ch./NEWS/vmnews">
<!ENTITY UDI008 SDATA "http://crnvmc.cern.ch./NEWS/cern">
<!ENTITY UDI023 SDATA
"http://info.cern.ch./hypertext/WWW/LineMode/Defaults/default.html">
<!ENTITY UDI015 SDATA "http://slacvm.slac.stanford.edu./FIND/spires">
<!ENTITY UDI018 SDATA "http://iicm.tu-graz.ac.at./jargon">
<!ENTITY UDI014 SDATA "http://info.cern.ch./hypertext/DataSources/Overview.html">
]>
<TITLE>CERN Information</TITLE>
<NEXTID N=10>
<SECTION><H1>CERN Information - Select by number</H1>
<DL>
<DT><A NAME="0" HREF=UDI001>Help</A>
<DD>On this program, or the
<A HREF=UDI002>World-Wide Web project</A>.
<DT><A NAME="2" HREF=UDI003>Phone book</A>
<DD>People, phone numbers, accounts and email addresses.
See also the analytical
<A NAME="yellow" HREF=UDI004>Yellow Pages</A>, or
the same index in French :
<A NAME="jaune" HREF=UDI005>Pages Jaunes</A>.
<DT><A NAME="1" HREF=UDI006>"XFIND" index</A>
<DD>Index of computer centre documentation, newsletters, news,
help files, etc...
<DT><A NAME="groups" HREF=UDI007>News</A>
<DD>A complete list of all public CERN news groups, such as
<A NAME="3" HREF=UDI008>news from the CERN User's
Office</A>,<A NAME="4" HREF=UDI009>
CERN computer center news</A>,<A HREF=UDI010>
student news</A>. See also <A NAME="5" HREF=UDI011>private
groups</A> and <A NAME="inews" HREF=UDI012>Internet
news</A>.
</dl>
<SECTION><H2>From other sites</h2>
See online data by
<A NAME="subject" HREF=UDI013>subject</A>,
pointers to
<A HREF=UDI014>other forms of online data</a>, and the following specific databases:
<DL>
<DT><A NAME="spires" HREF=UDI015>SLAC SPIRES</A>
<DD>The High Energy Physics preprint index at Stanford Linear
Accelerator, California.
(This is the same information avialable via the QSPIRES facility on BITNET.
Include the word "FIND" as the first keyword, eg: K FIND AUTHOR FRED.).
<DT><A NAME="desy" HREF=UDI016>DESY documents</a>
<DD>Documents and help files from the DESY lab in Hamburg.
<DT><A NAME="archie" HREF=UDI017>
Archie</a>
<DD>An index of almost everything available by "anonymous FTP".
<DT><A NAME="7" HREF=UDI018>Hacker Jargon</a>
<DD>An index to a cross-referenced set of hacker terms. A demonstration
of the WWW gateway to the Graz Technical University Hyper-G database.
<DT><A NAME="9" HREF=UDI019>W.A.I.S.</a>
<DD>All kinds of information available from "Wide Area Information Servers".
<DT><A NAME="6" HREF=UDI020>CERN RPC</A>
<DD>The user guide for the RPC system developed in CERN CN division
(not Sun/RPC). This is an example of documentation (partially) converted
into hypertext.
<DT><A NAME="hut" HREF=UDI021>Helsinki</a>
<DD>Helsinki Technical University information service (Mostly Finnish).
<DT><A NAME="gopher" HREF=UDI022>Gophers</a>
<DD>Campus-wide information systems using "Gopher" software. (Requires
www version 1.1 or higher)
</DL>
(This page may be an out of date copy. See the
<A NAME="latest" HREF=UDI023>latest version</a>.)

--8<----