Re: WWW meta indexes (proposal)

Martijn Koster <m.koster@nexor.co.uk>
Message-id: <9310260944.AA15707@dxmint.cern.ch>
To: sanders@bsdi.com
Cc: www-talk@nxoc01.cern.ch
Subject: Re: WWW meta indexes (proposal)
In-reply-to: Your message of "Mon, 25 Oct 1993 12:49:49 EST." <9310251749.AA01445@austin.BSDI.COM>
Date: Tue, 26 Oct 1993 09:43:54 +0000
From: Martijn Koster <m.koster@nexor.co.uk>

> 
> WWW Indexing
> ============
> 
>   Ok folks.  It's time we got busy and made this better.

Hehe... I was just planning to start doing some perling on thursday night
based on the indexing discussion I started on c.i.www before going on
holiday :-) Did you have a look at
	http://www.nexor.co.uk/mak/doc/summarising/proposal.html ?

>  Let's hash out some of the issues and then do it.

Yes. You got my full support there :-)

 
> What we need to accomplish
> --------------------------
> 1) agree on the filename of the site.idx file

/site.idx is fine for me.

> 2) agree on the format of the file (either "foo: data" or something else).

I vote yes for foo: data (with RFC-822 like continuation line.

> 3) agree on the initial content and semantics of the index file

In your site.idx you have single index lines that are of the form
Index: <url> <description>. I have some comments about that:

To allow nice presentation of the search results I'd like some more info
here: URL, title and description at least. Somebody suggested some 
indication to the level of document, like "Service", "document", "mention".
I think this might be useful; I for one will hapily run a search engine
on a resulting database of services, whereas a database of all
documents is to big for my liking.
If you allow HTML in the description (which again would be nice for
presentation) it might be useful to have explicit keywords as well 
(to prevent people searching for HREF picking up references in the
description for example. 

If you start using all these different fields per URL it makes more sense
to split them up into records with separate "foo: data" fiels, and 
separated by blank lines. As the URLs are the main reason we want to do
indexing I don't think that is unreasonable.

However, if the concencsus is that this makes creating and maintaining
the index to much effort then I say go for the single line (it's easier
to grep through too :-).

> 4) setup an email address where people can send registration forms
>    (these don't have to be processed right away, yet).

I suggest this is actually a distribution list (that I'd like to
be on :-) That way we can have multiple servers on the net.
Unfortunately I am not in a position where I can offer to setup/
maintain such a list (oh, I'd love a WWW job... :-)

> Constraints
> -----------
> 1) The data format must be extensible (need I even say it)

I think this is an argument for a record-based url description;
you can add fields if / when you like. 

> 2) It must be simple enough that we can get started soon

Yes

> 3) It must allow for meta-indexing other protocols in the future

Isn't that implicit in using UR?s

> 4) the database must be distributed (so you can do the search on
>    a nearby site).

I think the only quick way we are going to decide on that is to run multiple
identical servers (like Archie). If we start to discuss distributed protocols
like X.500/whois++ I think we'll be here for a while.

> What we will need next
> ----------------------
> 1) software to accept and process registration forms (via email)
> 2) software for updating registration (a robot)
> 3) software for building the indexes (wais?)
> 4) software for searching the index and a site to host it
> 
>   I believe that the above is all fairly easy.

Should be.

>   To get the process of a WWW global index started I would like to propose
>   the following for a site registration file format.  This data should
>   be accessible on your server as http://server/site.idx_

Sure.

>   I believe this covers the basics and sufficiently allows for future
>   extension.

Summary: Yes, let's sort out the indexing. I like the host-specific 
information. I don't think the single "Index" lines allow for
future extension.

-- Martijn

__________
Internet: m.koster@nexor.co.uk
X-400: C=GB; A=Mark400; P=Nexor; O=Nexor; S=koster; I=M
X-500: c=GB@o=NEXOR@cn=Martijn Koster
WWW: http://web.nexor.co.uk/mak/mak.html