Re: More on Indexing and Moving one higher than HTML etc

Adrian John Howard (adrianh@cogs.susx.ac.uk)
Fri, 5 Aug 1994 13:10:03 +0200

[Paul Wain (Paul.Wain@brunel.ac.uk) said]
> On going to a higher level of markup, the only reply specific answer I got on
> this was that maybe we could use some form of word processor and convert down
> to HTML from this. This is fine except that for the amount of information we
> would be looking at this isnt very practical. We have probably in excess of
> 2000 documents that would be going up. We really dont have enough disk space to
> keep 2 copies (one word processor/one HTML) so it would need to be done on the
> fly, but that said think of the poor CPU :) Okay so its a hypothetical worst
> case, but thats what I am employed to come up with at the moment.

One alternative that immediately springs to mind is that you could have HTML
as your on-line representation, with on-the-fly conversion to whatever format
you are going to use for editing. Slower editing, faster access. Seems a
sensible tradeoff to me.

BTW what are your objections to using HTML as the markup language - is it just
the ease of authoring for the non-CS types? Have you considered using one of
the What-You-See-It-Something-Like-What-You-Get HTML editors which are
springing up all over the place.

> I talked about wanting to keep certain information in a file that may or may
> not be transparent (author, owner, keywords etc) and the more I think about it
> the more that I can see that people **wont** add this information to the files.
> After all why should they? It wont show up at the page view level so people
> tend look at the wider implications. How can we find a way, without inventing a
> submissions system, to enforce people to use this information.

You've answered your own questions - there is no way of _forcing_ people to
add this information without some form of submissions system (short of threats
of physical violence :-)

One fairly easy method to "encourage" people to do it would be to provide
skeleton frameworks for documents with the "author", "keyword", etc fields
filled in with something suitably provocative to ensure the author filled them
out properly (Author:nobody, Keywords:nothing-of-interest... you get the
idea.)

Also, I would also have thought that in some cases you won't want the author
of the document to be able to alter certain fields (owner for example) - which
leads me to think that you are going to need some form of submission system
anyway. (or are you going to allow access to your server by everybody who is
going to author documents?)

> Im fairly sure now that we will need to come up with our own indexing system.
> Again this is due to the number of documents we are looking it. It would need
> to be able to run on the files themselves rather than the HTTP output, it would
> need to automatically update the files (so the users dont need to run it when
> they add a file in), and as such it must be able to understand how to arrive at
> the URL for the file. Is this do able? I cant see a way unless I can get around
> the problem in the previous paragraph.

Indexing is very doable.

If you're running a single server, then it should be fairly easy to write
something to scan for the keyword fields in your documents and produce a
suitable HTML file. If your running on multiple servers, modifying one of the
existing web-robots is probably your simplest choice (if you reject ALIWEB
that is - I still don't quite understand why it's not suitable for you).

As previously mentioned I think you are going to have to implement some form
of submissions system anyway - in which case you could tack index building
onto the end of that. Otherwise, just rebuild it once a day or something.

> Also there were very few ideas on how to track author and ownership. Does this
> mean that no one has looked at this issue?

We have a fairly ad-hoc system here and just use the existing Unix user/group
permissions on the W3 servers file-system to track ownership and control
access (along with encouraging people to use add that <address> at the end of
the document.) [BTW Has anybody ever tried using an existing document control
system like RCS to keep track of HTML files?] This of course becomes a lot
more complex (or falls down completely) when you have more than one server, or
information being mirrored on different servers.

Oh well, back to work.

Adrian

aids (adrianh@cogs.susx.ac.uk) ObDisclamer: Poplog used to pay my wages
Phone: +44 (0)273 678367 URL: http://www.cogs.susx.ac.uk/users/adrianh/