HTML-PS converter

Fred E Potts (
Fri, 16 Dec 1994 18:27:57 +0100


I used the HaLsoft Validation service and just fed the URL into it,
which was the way I figured you would want to have it parsed.

When I changed the <!DOCTYPE to:


it parsed okay. <gag!> Live and learn. This presents a real
interesting problem, and it looks as though a bit of work needs to be

It seems the following is how DOCTYPE is currently being used for 2.0:

PUBLIC "-//IETF//DTD HTML//EN" html.dtd
PUBLIC "-//IETF//DTD HTML 2.0//EN" html.dtd
PUBLIC "-//IETF//DTD HTML Level 2//EN" html.dtd
PUBLIC "-//IETF//DTD HTML 2.0 Level 2//EN" html.dtd


----- Begin Included Message -----

> This is what sgmls set at recommended (strict) has to say about
> :
> sgmls: Error at -, line 1 in declaration parameter 4:
> Could not find external document type "HTML"

> What am I missing here? Certainly a return like the above would cause
> me to rework the document.

Well, it says that it couldn't find the DTD for HTML. I would expect
pretty much everything to fail after that. Since I did something
unusual for the WWW, and included the <!DOCTYPE to fetch the HTML
public DTD, I have to assume your sgml system isn't configured
properly (or mine isn't).

> As far as I can tell, most documents on the Web can't pass the current
> DTD, not to mention when sgmls is set to ``recommended.'' And this
> certainly goes for documents prepared using an HTML authoring editor.

There are two issues involved here. One is that a web browser fetches
only one entity. The only way for that single entity to validate under
sgml is for it to include the HTML decleration and DTD, which you
really don't want to include in every document you send over the wire
(the DTD is noticably larger than most HTML pages).

The other is that the document type is known to be HTML, so many
people don't even bother with the DOCTYPE statement.

My solution is a Rexx script that checks for the DOCTYPE, adds it if
it isn't there, and feeds the HTML decleration to sgmls before the
document. Basically, I treat the HTML document as an entity that's
expected to include only the instance. There might be a better way to
do this, but it does work.

BTW, I realized a word was left out of the original post; the first
line should have been:

As if it were reasonable to expect a browser to swallow all legal HTML.


----- End Included Message -----