WWW news indexing gateway

mcharity@hq.lcs.mit.edu (Mitchell N Charity)
Date: Fri, 8 Oct 93 22:49:07 EDT
From: mcharity@hq.lcs.mit.edu (Mitchell N Charity)
Message-id: <9310090249.AA04582@hq.lcs.mit.edu>
To: www-talk@nxoc01.cern.ch
Subject: WWW news indexing gateway
Reply-To: mcharity@lcs.mit.edu
X-Phone: NE43-512:(617)253-6023  fax:258-8682  home:497-1506
In the two hour hack category...

There is a usenet news indexer, "ni", created by Mike Burrows of DEC.
The client which comes with the distribution was never really intended
to be used by users, and it shows.  So, why not do a WWW gateway?

A WWW-ni gateway should (1) provide a friendly query language, and (2)
process results for readability.  General query mangling was too
difficult for the 2hr hack category, and I didnt even get to the
several easy special cases which would make a big difference.
Doing presentation was simple.

So, the result, unfortunately not globally accessible, is an index
page which describes the bare ni client's somewhat painful query
language, and which hands off queries to the perl script abstracted
below.  An example result follows.

"ni" is available from gatekeeper.dec.com.  It supports fielded, full
text search.  The current(?) version does _not_ do word adjacency.  A
500MB news spool is said to require 200MB of index, 45MB memory, and
to scale linearly.


-----[presentation code]-----
print "<title>ni</title><isindex><ul>\n";

open(IN,"-|") || exec $ni,"c $query";
chop($c = <IN>);

&fail if($c !~ /^[0-9]+$/o);
print "<b>$query <i>matched</i> $c <i>articles at $d</i></b><p>\n\n";
exit(0) if $c == 0;

open(IN,"-|") || exec $ni,"h $query";

for($i=0; $i<=$limit_n && $i <$c; $i++) {
  $l = &next_paragraph;
  $l =~ /Message-Id: <([^>]+)>/oi || &fail; $id = $1;
  $l =~ /Newsgroups: +(.+)/oi || &fail;     $gr = $1;
  $l =~ /Subject: +(.+)/oi || &fail;        $su = $1;
  $l =~ /From: +(.+)/oi || &fail;           $fr = $1;
  $l =~ /Date: +(.+)/oi || &fail;           $da = $1;
  $fr =~ s/.*\(([^\)]+)\).*/$1/;
  $gr =~ s/,([^,]+)/,<a href=\"news:$1\">$1<\/a>/g;
  $gr =~ s/^([^,]+)/<a href=\"news:$1\">$1<\/a>/;
  print "<li>[$gr] <a href=\"news:$id\"><b>$su</b> ($da) - $fr</a>\n";
print "</ul>\n";
----[end of code]----
----[example result (cleaned up fragment)]----
<b>subj(cite)&g(www) <i>matched</i> 8 <i>articles at 08 Oct 93 (22:21)

<li>[<a href="news:comp.infosystems.www">comp.infosystems.www</a>,
     <a href="news:alt.hypertext">alt.hypertext</a>]
   <a href="news:28d0nh$ds2@bradley.bradley.edu">
   <b>How to cite a web document</b> (29 sep 1993) - Jerry Whelan</a>
<li>[<a href="news:comp.infosystems.www">comp.infosystems.www</a>,
     <a href="news:alt.hypertext">alt.hypertext</a>]
   <a href="news:MARCA.93Sep29190059@wintermute.ncsa.uiuc.edu">
   <b>Re: How to cite a web document</b> (29 sep 1993) - Marc Andreessen</a>
<li>[<a href="news:comp.infosystems.www">comp.infosystems.www</a>,
     <a href="news:alt.hypertext">alt.hypertext</a>]
   <a href="news:28d82j$kdq@bradley.bradley.edu">
   <b>Re: How to cite a web document</b> (30 sep 1993) - Jerry Whelan</a>
----[end of example]----